Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Segfault when printing dataframe #46848

Closed
3 tasks done
hrenski opened this issue Apr 23, 2022 · 15 comments
Closed
3 tasks done

BUG: Segfault when printing dataframe #46848

hrenski opened this issue Apr 23, 2022 · 15 comments
Assignees
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version Segfault Non-Recoverable Error
Milestone

Comments

@hrenski
Copy link

hrenski commented Apr 23, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

df = pd.DataFrame(np.empty((1, 33), dtype = object))

for col in df.columns:
    df[col] = np.empty_like(df[col])
    
print(df)

Issue Description

I first wanted to put out a thank you to all those that work on keeping pandas going! All of your contributions have helped so many, and I appreciate it.

When running the example code, a segfault occurs at the last line: print(df). I'm sure this wouldn't be the recommended approach to doing this, but I wanted to give a minimal example showing the issues that I had come across. Also, note that replacing df[col] = np.empty_like(df[col]) with df[col] = np.empty(len(df), dtype = df[col].dtype) prevents the issue from arising. But, based on the output of faulthandler (see below), I opted to submit it to the pandas team.

Fatal Python error: Segmentation fault

Current thread 0x00007fabaf0ba180 (most recent call first):
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/array_algos/take.py", line 163 in _take_nd_ndarray
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/array_algos/take.py", line 117 in take_nd
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 1137 in take_nd
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 844 in _slice_take_blocks_ax0
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/generic.py", line 3916 in _slice
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 1533 in _get_slice_axis
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 1497 in _getitem_axis
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 827 in _getitem_tuple_same_dim
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 1462 in _getitem_tuple
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 961 in getitem
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 806 in _truncate_horizontally
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 790 in truncate
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/string.py", line 185 in _fit_strcols_to_terminal_width
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/string.py", line 49 in _get_string_representation
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/string.py", line 25 in to_string
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1128 in to_string
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/frame.py", line 1192 in to_string
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/frame.py", line 1011 in repr
File "/home/me/test.py", line 13 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.strptime, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json (total: 54)

Expected Behavior

I would expect the dataframe would be able to print after updating the columns.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.72-microsoft-standard-WSL2
Version : #1 SMP Wed Oct 28 23:40:43 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@hrenski hrenski added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 23, 2022
@hrenski hrenski changed the title BUG: BUG: Segfault when printing dataframe Apr 23, 2022
@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 23, 2022
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Apr 23, 2022
@rhshadrach
Copy link
Member

rhshadrach commented Apr 23, 2022

Thanks for the report! Interestingly I am getting print(df.to_string()) to work. Further investigations and PRs to fix are most welcome.

@hrenski
Copy link
Author

hrenski commented Apr 23, 2022

A little additional info: After some further testing I've found that the number of columns needed to trigger the segfault depends on the size of the terminal. Expanding my terminal window to full screen allow 33 columns to be print, but increasing the number of columns to 50 again produces a segfault.

As @rhshadrach mentioned, print(df.to_string()) works ok. But using print(str(df)) triggers the error.

Maybe the segfault occurs during the internal methods that create a "nice" representation of the dataframe; if that formatting isn't applied (e.g. to_string or if the dataframe "fits" in the terminal), then the error is bypassed.

I also wonder if this isn't somehow coupled with the state of the dataframe after the columns are updated via __setitem__. If I replace the for loop

for col in df.columns:
    df[col] = np.empty_like(df[col])

with

df = df.apply(lambda s: np.empty_like(s))

then print(df) works with no problem.

@jbrockmendel jbrockmendel added the Segfault Non-Recoverable Error label Apr 24, 2022
@roberthdevries
Copy link
Contributor

take

@roberthdevries
Copy link
Contributor

I have a fix that prevents the segfault, but I am not sure how a regression test could be made as the failure depends on the width of the terminal screen. I deem this not a workable regression test, but I am at a loss how to trigger the issue otherwise.
As far as I can determine the crash only happens when printing a dataframe requires the use of dots to indicate missing values when the screen is not wide enough.
Apparently there is some logic to determine this situation, and then take another code path. This other code path then takes care of the injection of the dots, but also triggers the code path that produces the segfault.
Ideally we can make a regression test that provokes the code path that produces the segfault without the print logic.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 26, 2022
@simonjayhawkins
Copy link
Member

I also wonder if this isn't somehow coupled with the state of the dataframe after the columns are updated via __setitem__

appears so.

first bad commit: [03dd698] BUG: DataFrame.setitem sometimes operating inplace (#43406)

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label May 26, 2022
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.4.3 May 26, 2022
@jbrockmendel
Copy link
Member

Yep, definitely weird. Slightly simpler reproducer:

df = pd.DataFrame(np.empty((1, 1), dtype = object))
df[0] = np.empty_like(df[0])
df[[0]]

If the fix in #47097 is correct, then it may be something afoot in cython.

@simonjayhawkins
Copy link
Member

Slightly simpler reproducer:

it looks like this underlying issue has been an issue since at least 0.25.3

@simonjayhawkins
Copy link
Member

last python code executed

func(arr, indexer, out, fill_value)

-> func(arr, indexer, out, fill_value)
(Pdb) a
arr = array([[None]], dtype=object)
indexer = array([0])
axis = 1
fill_value = None
allow_fill = False
(Pdb) pp locals()
{'allow_fill': False,
 'arr': array([[None]], dtype=object),
 'axis': 1,
 'dtype': dtype('O'),
 'fill_value': None,
 'flip_order': True,
 'func': <built-in function take_2d_axis1_object_object>,
 'indexer': array([0]),
 'mask_info': (None, False),
 'out': array([[None]], dtype=object),
 'out_shape': (1, 1),
 'out_shape_': [1, 1]}
(Pdb) 

@simonjayhawkins
Copy link
Member

If the fix in #47097 is correct, then it may be something afoot in cython.

or maybe numpy?

import numpy as np
import pandas as pd
from pandas._libs.algos import take_2d_axis1_object_object

print(pd.__version__)

arr = np.array([[None]], dtype=object)
print(arr)

indexer = np.array([0])
print(indexer)

out = np.array([[None]], dtype=object)
print(out)

take_2d_axis1_object_object(arr, indexer, out, None)
print(out)

arr2 = np.empty_like(arr)
print(arr2)

print(np.array_equal(arr, arr2))

take_2d_axis1_object_object(arr2, indexer, out, None)
print(out)
[[None]]
[0]
[[None]]
[[None]]
[[None]]
True
Segmentation fault (core dumped)

@simonjayhawkins
Copy link
Member

If the fix in #47097 is correct, then it may be something afoot in cython.

FWIW I don't think that fix is working consistently. It appears to work for the code sample above calling take_2d_axis1_object_object directly

using a code sample to include the ordering logic of _take_nd_ndarray

arr = np.array([[None]], dtype=object)
indexer = np.array([0])
axis = 1
fill_value = None
allow_fill = False

result = _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
print(result)

arr2 = np.empty_like(arr)
result2 = _take_nd_ndarray(arr2, indexer, axis, fill_value, allow_fill)
print(result2)

still segfaults with the patch.

@simonjayhawkins
Copy link
Member

Maybe related:

#13717
numpy/numpy#7854
pygeos/pygeos#373

from pygeos/pygeos#374

Apparently, Numpy treats NULL the same as None in object-dtype arrays.

@roberthdevries
Copy link
Contributor

roberthdevries commented Jun 9, 2022

using a code sample to include the ordering logic of _take_nd_ndarray

Indeed, this time the issue pops up in take_2d_axis0_object_object (as opposed to take_2d_axis1_object_object), however a similar fix can be applied there as well.

@jorisvandenbossche
Copy link
Member

Can this issue be closed now the cython minimum version is bumped? (#47979)

@simonjayhawkins
Copy link
Member

see #47979 (comment), will close once MacPython/pandas-wheels#188 merged.

@simonjayhawkins
Copy link
Member

closed by MacPython/pandas-wheels#188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants