BUG: Segfault when printing dataframe #46848

hrenski · 2022-04-23T02:11:36Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

df = pd.DataFrame(np.empty((1, 33), dtype = object))

for col in df.columns:
    df[col] = np.empty_like(df[col])
    
print(df)

Issue Description

I first wanted to put out a thank you to all those that work on keeping pandas going! All of your contributions have helped so many, and I appreciate it.

When running the example code, a segfault occurs at the last line: print(df). I'm sure this wouldn't be the recommended approach to doing this, but I wanted to give a minimal example showing the issues that I had come across. Also, note that replacing df[col] = np.empty_like(df[col]) with df[col] = np.empty(len(df), dtype = df[col].dtype) prevents the issue from arising. But, based on the output of faulthandler (see below), I opted to submit it to the pandas team.

Fatal Python error: Segmentation fault

Current thread 0x00007fabaf0ba180 (most recent call first):
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/array_algos/take.py", line 163 in _take_nd_ndarray
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/array_algos/take.py", line 117 in take_nd
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 1137 in take_nd
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 844 in _slice_take_blocks_ax0
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/generic.py", line 3916 in _slice
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 1533 in _get_slice_axis
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 1497 in _getitem_axis
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 827 in _getitem_tuple_same_dim
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 1462 in _getitem_tuple
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/indexing.py", line 961 in getitem
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 806 in _truncate_horizontally
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 790 in truncate
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/string.py", line 185 in _fit_strcols_to_terminal_width
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/string.py", line 49 in _get_string_representation
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/string.py", line 25 in to_string
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1128 in to_string
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/frame.py", line 1192 in to_string
File "/home/me/sml-env/lib/python3.10/site-packages/pandas/core/frame.py", line 1011 in repr
File "/home/me/test.py", line 13 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.strptime, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json (total: 54)

Expected Behavior

I would expect the dataframe would be able to print after updating the columns.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.72-microsoft-standard-WSL2
Version : #1 SMP Wed Oct 28 23:40:43 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-04-23T15:11:58Z

Thanks for the report! Interestingly I am getting print(df.to_string()) to work. Further investigations and PRs to fix are most welcome.

hrenski · 2022-04-23T17:18:35Z

A little additional info: After some further testing I've found that the number of columns needed to trigger the segfault depends on the size of the terminal. Expanding my terminal window to full screen allow 33 columns to be print, but increasing the number of columns to 50 again produces a segfault.

As @rhshadrach mentioned, print(df.to_string()) works ok. But using print(str(df)) triggers the error.

Maybe the segfault occurs during the internal methods that create a "nice" representation of the dataframe; if that formatting isn't applied (e.g. to_string or if the dataframe "fits" in the terminal), then the error is bypassed.

I also wonder if this isn't somehow coupled with the state of the dataframe after the columns are updated via __setitem__. If I replace the for loop

for col in df.columns:
    df[col] = np.empty_like(df[col])

with

df = df.apply(lambda s: np.empty_like(s))

then print(df) works with no problem.

roberthdevries · 2022-05-23T17:21:08Z

take

roberthdevries · 2022-05-23T17:27:15Z

I have a fix that prevents the segfault, but I am not sure how a regression test could be made as the failure depends on the width of the terminal screen. I deem this not a workable regression test, but I am at a loss how to trigger the issue otherwise.
As far as I can determine the crash only happens when printing a dataframe requires the use of dots to indicate missing values when the screen is not wide enough.
Apparently there is some logic to determine this situation, and then take another code path. This other code path then takes care of the injection of the dots, but also triggers the code path that produces the segfault.
Ideally we can make a regression test that provokes the code path that produces the segfault without the print logic.

simonjayhawkins · 2022-05-26T13:02:16Z

I also wonder if this isn't somehow coupled with the state of the dataframe after the columns are updated via __setitem__

appears so.

first bad commit: [03dd698] BUG: DataFrame.setitem sometimes operating inplace (#43406)

cc @jbrockmendel

jbrockmendel · 2022-05-26T15:00:36Z

Yep, definitely weird. Slightly simpler reproducer:

df = pd.DataFrame(np.empty((1, 1), dtype = object))
df[0] = np.empty_like(df[0])
df[[0]]

If the fix in #47097 is correct, then it may be something afoot in cython.

simonjayhawkins · 2022-05-26T18:35:19Z

Slightly simpler reproducer:

it looks like this underlying issue has been an issue since at least 0.25.3

simonjayhawkins · 2022-06-06T11:08:20Z

last python code executed

pandas/pandas/core/array_algos/take.py

Line 163 in c0b659a

func(arr, indexer, out, fill_value)

-> func(arr, indexer, out, fill_value)
(Pdb) a
arr = array([[None]], dtype=object)
indexer = array([0])
axis = 1
fill_value = None
allow_fill = False
(Pdb) pp locals()
{'allow_fill': False,
 'arr': array([[None]], dtype=object),
 'axis': 1,
 'dtype': dtype('O'),
 'fill_value': None,
 'flip_order': True,
 'func': <built-in function take_2d_axis1_object_object>,
 'indexer': array([0]),
 'mask_info': (None, False),
 'out': array([[None]], dtype=object),
 'out_shape': (1, 1),
 'out_shape_': [1, 1]}
(Pdb)

simonjayhawkins · 2022-06-06T12:04:34Z

If the fix in #47097 is correct, then it may be something afoot in cython.

or maybe numpy?

import numpy as np
import pandas as pd
from pandas._libs.algos import take_2d_axis1_object_object

print(pd.__version__)

arr = np.array([[None]], dtype=object)
print(arr)

indexer = np.array([0])
print(indexer)

out = np.array([[None]], dtype=object)
print(out)

take_2d_axis1_object_object(arr, indexer, out, None)
print(out)

arr2 = np.empty_like(arr)
print(arr2)

print(np.array_equal(arr, arr2))

take_2d_axis1_object_object(arr2, indexer, out, None)
print(out)

[[None]]
[0]
[[None]]
[[None]]
[[None]]
True
Segmentation fault (core dumped)

simonjayhawkins · 2022-06-06T12:57:09Z

If the fix in #47097 is correct, then it may be something afoot in cython.

FWIW I don't think that fix is working consistently. It appears to work for the code sample above calling take_2d_axis1_object_object directly

using a code sample to include the ordering logic of _take_nd_ndarray

arr = np.array([[None]], dtype=object)
indexer = np.array([0])
axis = 1
fill_value = None
allow_fill = False

result = _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
print(result)

arr2 = np.empty_like(arr)
result2 = _take_nd_ndarray(arr2, indexer, axis, fill_value, allow_fill)
print(result2)

still segfaults with the patch.

simonjayhawkins · 2022-06-06T13:38:30Z

Maybe related:

#13717
numpy/numpy#7854
pygeos/pygeos#373

from pygeos/pygeos#374

Apparently, Numpy treats NULL the same as None in object-dtype arrays.

roberthdevries · 2022-06-09T10:34:04Z

using a code sample to include the ordering logic of _take_nd_ndarray

Indeed, this time the issue pops up in take_2d_axis0_object_object (as opposed to take_2d_axis1_object_object), however a similar fix can be applied there as well.

jorisvandenbossche · 2022-08-12T09:20:56Z

Can this issue be closed now the cython minimum version is bumped? (#47979)

simonjayhawkins · 2022-08-12T09:37:10Z

see #47979 (comment), will close once MacPython/pandas-wheels#188 merged.

simonjayhawkins · 2022-08-16T10:57:47Z

closed by MacPython/pandas-wheels#188

hrenski added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 23, 2022

hrenski changed the title ~~BUG:~~ BUG: Segfault when printing dataframe Apr 23, 2022

rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 23, 2022

rhshadrach added this to the Contributions Welcome milestone Apr 23, 2022

jbrockmendel added the Segfault Non-Recoverable Error label Apr 24, 2022

github-actions bot assigned roberthdevries May 23, 2022

roberthdevries mentioned this issue May 23, 2022

Fix segfault with printing dataframe #47097

Closed

4 tasks

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 26, 2022

code sample for pandas-dev#46848

a689227

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label May 26, 2022

simonjayhawkins modified the milestones: Contributions Welcome, 1.4.3 May 26, 2022

simonjayhawkins mentioned this issue Jun 8, 2022

RLS: 1.4.3 #46610

Closed

simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 21, 2022

simonjayhawkins mentioned this issue Aug 5, 2022

BUILD: Bump Cython again to (at least) 0.29.31 #47978

Closed

This was referenced Aug 5, 2022

DEPS: Update cython #47979

Merged

DEPS: Update cython MacPython/pandas-wheels#188

Merged

simonjayhawkins closed this as completed Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Segfault when printing dataframe #46848

BUG: Segfault when printing dataframe #46848

hrenski commented Apr 23, 2022

INSTALLED VERSIONS

rhshadrach commented Apr 23, 2022 •

edited

Loading

hrenski commented Apr 23, 2022

roberthdevries commented May 23, 2022

roberthdevries commented May 23, 2022

simonjayhawkins commented May 26, 2022

jbrockmendel commented May 26, 2022

simonjayhawkins commented May 26, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

roberthdevries commented Jun 9, 2022 •

edited

Loading

jorisvandenbossche commented Aug 12, 2022

simonjayhawkins commented Aug 12, 2022

simonjayhawkins commented Aug 16, 2022

BUG: Segfault when printing dataframe #46848

BUG: Segfault when printing dataframe #46848

Comments

hrenski commented Apr 23, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Apr 23, 2022 • edited Loading

hrenski commented Apr 23, 2022

roberthdevries commented May 23, 2022

roberthdevries commented May 23, 2022

simonjayhawkins commented May 26, 2022

jbrockmendel commented May 26, 2022

simonjayhawkins commented May 26, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

roberthdevries commented Jun 9, 2022 • edited Loading

jorisvandenbossche commented Aug 12, 2022

simonjayhawkins commented Aug 12, 2022

simonjayhawkins commented Aug 16, 2022

rhshadrach commented Apr 23, 2022 •

edited

Loading

roberthdevries commented Jun 9, 2022 •

edited

Loading