Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: multi-type SparseDataFrame fixes and improvements #13917

Closed
wants to merge 15 commits into from
Closed

BUG: multi-type SparseDataFrame fixes and improvements #13917

wants to merge 15 commits into from

Conversation

sstanovnik
Copy link
Contributor

@sstanovnik sstanovnik commented Aug 5, 2016

  • 1 tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

Types were incorrectly determined when slicing SparseDataFrames with
multiple dtypes (such as float and object) into SparseSeries.

import pandas as pd
sdf = pd.SparseDataFrame([[1, 2, 'a'], [4, 5, 'b']])
sdf.iloc[[0]]  # OK
sdf.iloc[0]  # ValueError: could not convert string to float: 'a'

No existing issue covers this.

@codecov-io
Copy link

codecov-io commented Aug 5, 2016

Current coverage is 85.30% (diff: 100%)

Merging #13917 into master will not change coverage

@@             master     #13917   diff @@
==========================================
  Files           139        139          
  Lines         50157      50157          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42785      42785          
  Misses         7372       7372          
  Partials          0          0          

Powered by Codecov. Last update b7abef4...8c7d1ea

@@ -4440,7 +4440,10 @@ def _lcd_dtype(l):
""" find the lowest dtype that can accomodate the given types """
m = l[0].dtype
for x in l[1:]:
if x.dtype.itemsize > m.itemsize:
# the new dtype must either be wider or a strict subtype
if (x.dtype.itemsize > m.itemsize or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't np.find_common_type do this? (or .common_type)?

should create a pandas version of this to isolate (e.g. this won't handle non-numpy types)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Aug 5, 2016
@jreback
Copy link
Contributor

jreback commented Aug 5, 2016

@sinhrks

@sinhrks
Copy link
Member

sinhrks commented Aug 5, 2016

as sparse mainly supports float64, adding more tests for mixed dtype is appreciated (maybe in the same file as separate class).

@sstanovnik sstanovnik changed the title BUG: slicing single rows of multi-type SparseDataFrames. BUG: multi-type SparseDataFrame fixes and improvements Aug 8, 2016
@sstanovnik
Copy link
Contributor Author

I used numpy's find_common_type instead of that local function, this changed a test (test_block_internals.py), see if that numpy behaviour is desired.

Another (maybe non-minor) change is the default parameter in sparse/array.py, but that didn't seem to really break anything.

I also added some new tests as suggested, but don't feel confident in adding the proposed pandas alternative to the numpy find_common_type - my knowledge of pandas dtypes isn't really great. It would likely be similar to _interleaved_dtype in core/internals.py right?

def _lcd_dtype(l):
""" find the lowest dtype that can accomodate the given types """
m = l[0].dtype
for x in l[1:]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant was can start off by simply moving this (your version or new one with np.find_common_type to pandas.types.cast (and if we have specific tests move similarly; if not, ideally add some). we can add the pandas specific functionaility later.

@sinhrks
Copy link
Member

sinhrks commented Aug 8, 2016

can u also add tests to check normal SparseArray and its dtypes (related to the default change)?

current default (float64) is because it doesn't fully support int and bool, should be solved by #13849.

@sstanovnik
Copy link
Contributor Author

It turns out this PR works without the default argument change, I was too hasty to change it. Your PR fixes that better, so I reverted the change.

Common type discovery moved to types/cast.py:_find_common_type. I did notice, however, that types/common.py:_lcd_dtypes exists (seems to only work with numpy dtypes, but differently than np.find_common_type), but changing either implementation to the other broke some things, so I left it as it is.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2016

yep, need to fix this.

@sstanovnik can you create a new issue for this.

[jreback-~/pandas] grin _lcd_dtype pandas
pandas/core/frame.py:
   47 :                                  _lcd_dtypes,
 3705 :                 new_dtype = _lcd_dtypes(this_dtype, other_dtype)
pandas/core/internals.py:
 4438 :     def _lcd_dtype(l):
 4473 :         lcd = _lcd_dtype(counts[IntBlock])
 4489 :         return _lcd_dtype(counts[FloatBlock] + counts[SparseBlock])
pandas/types/common.py:
  389 : def _lcd_dtypes(a_dtype, b_dtype):


values = self.mixed_int.as_matrix(['A', 'D'])
self.assertEqual(values.dtype, np.int64)

# guess all ints are cast to uints....
# B uint64 forces float because there are other signed int types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might fix another bug, can you search for uint64 issues and see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue as a reference here
and in the whatsnew

@sstanovnik sstanovnik mentioned this pull request Aug 9, 2016
2 tasks
@sstanovnik
Copy link
Contributor Author

Opened issue. Moved the tests, added new tests. Found and processed #10364.

@@ -188,6 +191,42 @@ def test_possibly_convert_objects_copy(self):
self.assertTrue(values is not out)


class TestCommonTypes(tm.TestCase):
def setUp(self):
super(TestCommonTypes, self).setUp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need setUp here

self.assertEqual(_find_common_type([np.object]), np.object)

self.assertEqual(_find_common_type([np.int16, np.int64]),
np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group them

eg ints

floats

easier to read

@@ -437,6 +437,7 @@ API changes
- ``pd.Timedelta(None)`` is now accepted and will return ``NaT``, mirroring ``pd.Timestamp`` (:issue:`13687`)
- ``Timestamp``, ``Period``, ``DatetimeIndex``, ``PeriodIndex`` and ``.dt`` accessor have gained a ``.is_leap_year`` property to check whether the date belongs to a leap year. (:issue:`13727`)
- ``pd.read_hdf`` will now raise a ``ValueError`` instead of ``KeyError``, if a mode other than ``r``, ``r+`` and ``a`` is supplied. (:issue:`13623`)
- ``.values`` will now return ``np.float64`` with a ``DataFrame`` with ``np.int64`` and ``np.uint64`` dtypes, conforming to ``np.find_common_type`` (:issue:`10364`, :issue:`13917`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.values will now ....with a frame of mixed int64 and uint64 dtypes.....

@@ -764,6 +765,7 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
- Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`)
- Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`)
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
- Bug in single row slicing on multi-type ``SparseDataFrame``s: types were previously forced to float (:issue:`13917`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, types where previously...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads fine to me? There's a colon after SparseDataFrames

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change : to ,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait sorry I misread your comment.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2016

lgtm. @sinhrks ?

@jreback jreback added this to the 0.19.0 milestone Aug 9, 2016
@sstanovnik
Copy link
Contributor Author

Thanks for your patience.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2016

ha! thanks for yours

@sinhrks
Copy link
Member

sinhrks commented Aug 9, 2016

lgtm, thx @sstanovnik ! find_common_type is needed for #13849 also :)

@jreback jreback closed this in 0e7ae89 Aug 10, 2016
@jreback
Copy link
Contributor

jreback commented Aug 10, 2016

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants