BUG: multi-type SparseDataFrame fixes and improvements #13917

sstanovnik · 2016-08-05T12:28:58Z

1 tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Types were incorrectly determined when slicing SparseDataFrames with
multiple dtypes (such as float and object) into SparseSeries.

import pandas as pd
sdf = pd.SparseDataFrame([[1, 2, 'a'], [4, 5, 'b']])
sdf.iloc[[0]]  # OK
sdf.iloc[0]  # ValueError: could not convert string to float: 'a'

No existing issue covers this.

codecov-io · 2016-08-05T16:07:13Z

Current coverage is 85.30% (diff: 100%)

Merging #13917 into master will not change coverage

@@             master     #13917   diff @@
==========================================
  Files           139        139          
  Lines         50157      50157          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42785      42785          
  Misses         7372       7372          
  Partials          0          0

Powered by Codecov. Last update b7abef4...8c7d1ea

jreback · 2016-08-05T16:36:30Z

pandas/core/internals.py

@@ -4440,7 +4440,10 @@ def _lcd_dtype(l):
        """ find the lowest dtype that can accomodate the given types """
        m = l[0].dtype
        for x in l[1:]:
-            if x.dtype.itemsize > m.itemsize:
+            # the new dtype must either be wider or a strict subtype
+            if (x.dtype.itemsize > m.itemsize or


doesn't np.find_common_type do this? (or .common_type)?

should create a pandas version of this to isolate (e.g. this won't handle non-numpy types)

jreback · 2016-08-05T16:36:46Z

@sinhrks

sinhrks · 2016-08-05T20:15:19Z

as sparse mainly supports float64, adding more tests for mixed dtype is appreciated (maybe in the same file as separate class).

Types were incorrectly determined when slicing SparseDataFrames with multiple dtypes (such as float and object). Also enables type inference for SparseArrays by default.

sstanovnik · 2016-08-08T19:46:03Z

I used numpy's find_common_type instead of that local function, this changed a test (test_block_internals.py), see if that numpy behaviour is desired.

Another (maybe non-minor) change is the default parameter in sparse/array.py, but that didn't seem to really break anything.

I also added some new tests as suggested, but don't feel confident in adding the proposed pandas alternative to the numpy find_common_type - my knowledge of pandas dtypes isn't really great. It would likely be similar to _interleaved_dtype in core/internals.py right?

jreback · 2016-08-08T19:50:28Z

pandas/core/internals.py

-    def _lcd_dtype(l):
-        """ find the lowest dtype that can accomodate the given types """
-        m = l[0].dtype
-        for x in l[1:]:


what I meant was can start off by simply moving this (your version or new one with np.find_common_type to pandas.types.cast (and if we have specific tests move similarly; if not, ideally add some). we can add the pandas specific functionaility later.

sinhrks · 2016-08-08T21:06:12Z

can u also add tests to check normal SparseArray and its dtypes (related to the default change)?

current default (float64) is because it doesn't fully support int and bool, should be solved by #13849.

sstanovnik · 2016-08-09T10:14:19Z

It turns out this PR works without the default argument change, I was too hasty to change it. Your PR fixes that better, so I reverted the change.

Common type discovery moved to types/cast.py:_find_common_type. I did notice, however, that types/common.py:_lcd_dtypes exists (seems to only work with numpy dtypes, but differently than np.find_common_type), but changing either implementation to the other broke some things, so I left it as it is.

jreback · 2016-08-09T10:48:24Z

yep, need to fix this.

@sstanovnik can you create a new issue for this.

[jreback-~/pandas] grin _lcd_dtype pandas
pandas/core/frame.py:
   47 :                                  _lcd_dtypes,
 3705 :                 new_dtype = _lcd_dtypes(this_dtype, other_dtype)
pandas/core/internals.py:
 4438 :     def _lcd_dtype(l):
 4473 :         lcd = _lcd_dtype(counts[IntBlock])
 4489 :         return _lcd_dtype(counts[FloatBlock] + counts[SparseBlock])
pandas/types/common.py:
  389 : def _lcd_dtypes(a_dtype, b_dtype):

jreback · 2016-08-09T10:49:33Z

pandas/tests/frame/test_block_internals.py


        values = self.mixed_int.as_matrix(['A', 'D'])
        self.assertEqual(values.dtype, np.int64)

-        # guess all ints are cast to uints....
+        # B uint64 forces float because there are other signed int types


this might fix another bug, can you search for uint64 issues and see?

add the issue as a reference here
and in the whatsnew

sstanovnik · 2016-08-09T13:15:03Z

Opened issue. Moved the tests, added new tests. Found and processed #10364.

jreback · 2016-08-09T13:17:43Z

pandas/tests/types/test_cast.py

@@ -188,6 +191,42 @@ def test_possibly_convert_objects_copy(self):
        self.assertTrue(values is not out)


+class TestCommonTypes(tm.TestCase):
+    def setUp(self):
+        super(TestCommonTypes, self).setUp()


you don't need setUp here

jreback · 2016-08-09T13:18:17Z

pandas/tests/types/test_cast.py

+        self.assertEqual(_find_common_type([np.object]), np.object)
+
+        self.assertEqual(_find_common_type([np.int16, np.int64]),
+                         np.int64)


group them

eg ints

floats

easier to read

jreback · 2016-08-09T21:56:33Z

doc/source/whatsnew/v0.19.0.txt

@@ -437,6 +437,7 @@ API changes
 - ``pd.Timedelta(None)`` is now accepted and will return ``NaT``, mirroring ``pd.Timestamp`` (:issue:`13687`)
 - ``Timestamp``, ``Period``, ``DatetimeIndex``, ``PeriodIndex`` and ``.dt`` accessor have gained a ``.is_leap_year`` property to check whether the date belongs to a leap year. (:issue:`13727`)
 - ``pd.read_hdf`` will now raise a ``ValueError`` instead of ``KeyError``, if a mode other than ``r``, ``r+`` and ``a`` is supplied. (:issue:`13623`)
+- ``.values`` will now return ``np.float64`` with a ``DataFrame`` with ``np.int64`` and ``np.uint64`` dtypes, conforming to ``np.find_common_type`` (:issue:`10364`, :issue:`13917`)


DataFrame.values will now ....with a frame of mixed int64 and uint64 dtypes.....

jreback · 2016-08-09T22:25:35Z

doc/source/whatsnew/v0.19.0.txt

@@ -764,6 +765,7 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
 - Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value``  (:issue:`13866`)
 - Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`)
 - Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
+- Bug in single row slicing on multi-type ``SparseDataFrame``s: types were previously forced to float (:issue:`13917`)


, types where previously...

This reads fine to me? There's a colon after SparseDataFrames

change : to ,

Oh wait sorry I misread your comment.

jreback · 2016-08-09T22:26:53Z

lgtm. @sinhrks ?

sstanovnik · 2016-08-09T22:32:18Z

Thanks for your patience.

jreback · 2016-08-09T22:36:08Z

ha! thanks for yours

sinhrks · 2016-08-09T22:37:40Z

lgtm, thx @sstanovnik ! find_common_type is needed for #13849 also :)

jreback · 2016-08-10T10:42:54Z

thanks!

jreback reviewed Aug 5, 2016
View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Aug 5, 2016

sstanovnik added 6 commits August 8, 2016 21:33

BUG: multi-type sparse slicing fixes and improvements

2e833fa

Types were incorrectly determined when slicing SparseDataFrames with multiple dtypes (such as float and object). Also enables type inference for SparseArrays by default.

Add a whatsnew note.

fb6237c

Use numpy to determine common dtypes.

c7fb0f2

Infer dtype instead of forcing float in SparseArray.

114217e

Additional multitype tests.

33973a5

Modified the whatsnew message.

93d2de6

sstanovnik changed the title ~~BUG: slicing single rows of multi-type SparseDataFrames.~~ BUG: multi-type SparseDataFrame fixes and improvements Aug 8, 2016

jreback reviewed Aug 8, 2016
View reviewed changes

sstanovnik added 2 commits August 9, 2016 12:08

Revert default argument change.

6782bc7

Factor the common type discovery to an internal function.

2104948

jreback reviewed Aug 9, 2016
View reviewed changes

sstanovnik mentioned this pull request Aug 9, 2016

Unified common dtype discovery #13947

Closed

2 tasks

sstanovnik added 3 commits August 9, 2016 14:48

Modify .values docs to process issue #10364.

ac790d7

Moved multitype tests to sparse/tests/test_multitype.py

eebcb23

Add tests for common dtypes, raises check for pandas ones.

8d675ad

jreback reviewed Aug 9, 2016
View reviewed changes

sstanovnik added 2 commits August 9, 2016 15:32

Whatsnew, issue tag, test reordering.

442b8c1

Fix a derp.

926ca1e

jreback reviewed Aug 9, 2016
View reviewed changes

Wording and code organization fixes.

057d56b

jreback reviewed Aug 9, 2016
View reviewed changes

jreback added this to the 0.19.0 milestone Aug 9, 2016

Colon to comma.

8c7d1ea

jreback closed this in 0e7ae89 Aug 10, 2016

sinhrks mentioned this pull request Aug 20, 2016

ENH: Sparse int64 and bool dtype support enhancement #13849

Merged

4 tasks

petehuang mentioned this pull request Dec 28, 2016

BUG/DOC: DataFrame.values return type when uint64 is mixed with signed int types #10364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: multi-type SparseDataFrame fixes and improvements #13917

BUG: multi-type SparseDataFrame fixes and improvements #13917

sstanovnik commented Aug 5, 2016 •

edited

Loading

codecov-io commented Aug 5, 2016 •

edited

Loading

jreback Aug 5, 2016

jreback commented Aug 5, 2016

sinhrks commented Aug 5, 2016

sstanovnik commented Aug 8, 2016

jreback Aug 8, 2016

sinhrks commented Aug 8, 2016

sstanovnik commented Aug 9, 2016

jreback commented Aug 9, 2016

jreback Aug 9, 2016

jreback Aug 9, 2016

sstanovnik commented Aug 9, 2016

jreback Aug 9, 2016

jreback Aug 9, 2016

jreback Aug 9, 2016

jreback Aug 9, 2016

sstanovnik Aug 9, 2016

jreback Aug 9, 2016

sstanovnik Aug 9, 2016

jreback commented Aug 9, 2016

sstanovnik commented Aug 9, 2016

jreback commented Aug 9, 2016

sinhrks commented Aug 9, 2016

jreback commented Aug 10, 2016

BUG: multi-type SparseDataFrame fixes and improvements #13917

BUG: multi-type SparseDataFrame fixes and improvements #13917

Conversation

sstanovnik commented Aug 5, 2016 • edited Loading

codecov-io commented Aug 5, 2016 • edited Loading

Current coverage is 85.30% (diff: 100%)

Choose a reason for hiding this comment

jreback commented Aug 5, 2016

sinhrks commented Aug 5, 2016

sstanovnik commented Aug 8, 2016

Choose a reason for hiding this comment

sinhrks commented Aug 8, 2016

sstanovnik commented Aug 9, 2016

jreback commented Aug 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sstanovnik commented Aug 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 9, 2016

sstanovnik commented Aug 9, 2016

jreback commented Aug 9, 2016

sinhrks commented Aug 9, 2016

jreback commented Aug 10, 2016

sstanovnik commented Aug 5, 2016 •

edited

Loading

codecov-io commented Aug 5, 2016 •

edited

Loading