In-memory to_csv compression #22555

ZaxR · 2018-08-31T17:01:36Z

Code Sample, a copy-pastable example if possible

# Attempt 1
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
test = df.to_csv(compression="gzip")
type(test)

RuntimeWarning: compression has no effect when passing file-like object as input.
Out: str

# Attempt 2
from io import BytesIO
b_buf = BytesIO()
df.to_csv(b_buf, compression="gzip")

Out: TypeError: a bytes-like object is required, not 'str'

Problem description

I am trying to gzip compress a dataframe in memory (as opposed to directly to a named file location). The use case for this is (I imagine) similar to the reason by to_csv now allows not specifying a path in other cases to create an in memory representation, but specifically my case is that I need to save the compressed df to a cloud location using a custom URI, and I'm temporarily keeping it in memory for that purpose.

Expected Output

I would expect the compression option to result in a compressed, bytes object (similar to the gzip library).

Thank you in advance for your help!

Note: I originally saw #21227 (df.to_csv ignores compression when provided with a file handle), and thought it might have also been a fix, but looks like it just stopped a little short of fixing my issue as well.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-09-01T22:51:32Z

Interesting proposal! As a (temporary) workaround, could you not save to disk and then read into memory by any chance?

BTW, if you have ideas on how to implement in-memory compression, go for it!

ZaxR · 2018-09-06T04:09:34Z

Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn't ideal (can't be sure what the end-user's local environment will look like).

My thought was something like this as a workaround:

import gzip
from io import BytesIO
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
b_buf = BytesIO()
with gzip.open(b_buf, 'wb') as f:
    f.write(df.to_string().encode())

gfyoung · 2018-09-06T04:24:21Z

@ZaxR : Gotcha. Interesting...I think it would be a solid enhancement nonetheless. Would be open to proposals for implementation.

dhimmel · 2018-09-21T03:05:48Z

I agree that the compression argument should take effect when a file-like object is passed. This enhancement would likely include implementing zip writing support in _get_handle, which would address the frustration I had in #22011 (comment).

silverdrake11 · 2018-10-13T06:43:32Z

For the 'gzip' compression, _get_handle() is not being called when BytesIO() is passed. This causes it to fail at csv.writer(gzip.GzipFile(fileobj=BytesIO())) in csvs.py

If _get_handle() is called on BytesIO() then what happens is csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO()))) which fails b/c GzipFile opens it as read only. Setting the mode will work csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO(), mode=mode)))

The 'bz2' compression fix is the same. 'xz' will not compress BytesIO() unless LZMACompressor is used. 'zip' has the custom workflow referenced by dhimmel, which complicates it further.

There is too much logic in _get_handle, and it is called many times for reading and for writing. One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

In order to actually call _get_handle()on BytesIO(), the elif hasattr(self.path_or_buf, 'write') in csvs.py has to be changed so that BytesIO() doesn't end up there but StringIO() does. For Python 3 this is enough to fix it.

For Python 2, the exception about not supporting a custom encoding gets raised in _get_handle. This is b/c CSVFormater() sets encoding='ascii' while _get_handle expects it to be None which is actually 'ascii'.

This is the test code I was using:

hello = BytesIO()
test = df.to_csv(hello, compression='gzip')
print(hello.getvalue())

dhimmel · 2018-10-15T16:00:37Z

There is too much logic in _get_handle, and it is called many times for reading and for writing.

I agree. Especially since there is no docstring to define what the function intends to support.

One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

I agree it may be helpful to split read/write and 2/3. However, with 2019 only a couple months away, the purging of 2 from pandas is just around the corner. It seems like any major changes should plan for a 3-only codebase (and hence benefit from the great simplification)?

gfyoung · 2018-10-15T16:29:10Z

@dhimmel : I agree that the implementation should be 3-oriented. If it's 2-oriented as well, great! If not, I would still write it out but just hold off on the PR until the turn of the calendar year.

ZaxR · 2019-02-08T19:58:14Z

@dhimmel and @gfyoung - Now that we've reached the new year/pandas release, any update on this enhancement? Would really love this capability :)

gfyoung · 2019-02-08T20:27:01Z

@ZaxR : Thanks for the ping! We're still in the process of releasing versions that are Python-2-compatible, so we might want to hold on this a little longer. That being said, proposals are a pure Python-3-compatible implementation would be great 👍

MarcSkovMadsen · 2020-02-06T11:34:47Z

I was also looking for the compress functionality in order to produce base64 encoded links.

@staticmethod
def _to_base64_encoded_link(data: pd.DataFrame):
    csv = data.to_csv(index=False)
    b64 = base64.b64encode(
        csv.encode()
    ).decode()  # some strings <-> bytes conversions necessary here
    link = f'<a href="data:file/csv;base64,{b64}" download="data.csv">Download</a>'
    return link

Currently my dataframes are to big. So I would like to compress them.

* Moves column from metadata to body Max metadata size is 8kb and became a problem when the dataset has lots of columns. * Removes redundant parameter encoding='utf-8' is the default value * New metadata storage solution MinIO/S3 object metadata has lots of problems: max 8kb size, allows only Dict[str, str], ... We had a problem saving a file with lots of columns (~1000) and we couldn't store feature types as metadata since it exceeded the 8kb limit. As a solution, now the columns are the 1st row in the csv file and the metadata is in a separate file encoded in JSON format. The unit test was modified to a large dataset 1000 cols, 1mi rows * Removes compression, since it does not work yet on pandas See: pandas-dev/pandas#22555 * Fix list_datasets issue Two files are created for each dataset. * Decreases mock_dataset size Causing 502 bad gateway on play.min.io * Improves code when reading saved metrics

davidkh1255 · 2020-05-20T00:50:08Z

I modified some code which seems to produce the bz2 for me and appears to avoid the intermediate on-disk csv file. Does this help you guys? I'm not sure how efficient the memory allocation is or how in-memory it all is, so if there are optimizations, please advise! I'm a py newbie (first day), and this stuff seemed to be poorly documented, so thanks for making this thread and apologies for my ignorance.

#This code takes a pandas df and makes clickable link in your ipynb UI to download a bz2 compressed file
#Updated 2020-05-19 davidkh1255

#Housekeeping - BEGIN
import pandas as pd
import bz2
import base64
from IPython.display import HTML
#Housekeeping - END

#Create test pandas dataframe from example in 22555, and add D col and data
mypd = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12], "D":[90,91,92,93]})

#Note: this requires a pandas df as input
def create_download_link( df, title = "Download bz2 file", filename = "data.bz2"):
csv = df.to_csv(index=False) # turn off index numbers
bz = bz2.BZ2Compressor() # setup the BZ2Compressor
#b64 = base64.b64encode(csv.encode()) # to not use bz2, uncomment this and comment out next line
b64 = base64.b64encode(bz.compress(csv.encode())+bz.flush()) # do not forget to flush otherwise output will be zero byte file
payload = b64.decode()
html = '{title}'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
#END def create_download_link

#Call the function with your pandas df
create_download_link(mypd)

WillAyd added the IO CSV read_csv, to_csv label Aug 31, 2018

gfyoung added the Enhancement label Sep 1, 2018

akhmerov mentioned this issue Apr 29, 2019

to_pickle compression does not work with in-memory buffers #26237

Closed

This was referenced Sep 19, 2019

Add compression for Pandas.to_parquet aws/aws-sdk-pandas#28

Merged

Add parameter to specify file compression for csv files aws/aws-sdk-pandas#26

Closed

toninlg mentioned this issue Jul 1, 2020

df.to_csv ignores compression when provided with a file handle #21227

Closed

igorborgest mentioned this issue Jul 3, 2020

s3.to_csv ignores pandas_kwargs when dataset is set to True aws/aws-sdk-pandas#308

Closed

twoertwein mentioned this issue Jul 18, 2020

support binary file handles in to_csv #35129

Merged

5 tasks

jreback added this to the 1.2 milestone Aug 3, 2020

jreback closed this as completed in #35129 Aug 7, 2020

JasonSanchez mentioned this issue Aug 24, 2020

Read and write compressed CSVs to S3 aws/aws-sdk-pandas#359

Closed

amotl mentioned this issue Oct 20, 2020

BUG: Problem using to_csv with BytesIO #37292

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-memory to_csv compression #22555

In-memory to_csv compression #22555

ZaxR commented Aug 31, 2018 •

edited

Loading

INSTALLED VERSIONS

gfyoung commented Sep 1, 2018 •

edited

Loading

ZaxR commented Sep 6, 2018

gfyoung commented Sep 6, 2018

dhimmel commented Sep 21, 2018

silverdrake11 commented Oct 13, 2018 •

edited

Loading

dhimmel commented Oct 15, 2018

gfyoung commented Oct 15, 2018

ZaxR commented Feb 8, 2019

gfyoung commented Feb 8, 2019

MarcSkovMadsen commented Feb 6, 2020 •

edited

Loading

davidkh1255 commented May 20, 2020 •

edited

Loading

In-memory to_csv compression #22555

In-memory to_csv compression #22555

Comments

ZaxR commented Aug 31, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Sep 1, 2018 • edited Loading

ZaxR commented Sep 6, 2018

gfyoung commented Sep 6, 2018

dhimmel commented Sep 21, 2018

silverdrake11 commented Oct 13, 2018 • edited Loading

dhimmel commented Oct 15, 2018

gfyoung commented Oct 15, 2018

ZaxR commented Feb 8, 2019

gfyoung commented Feb 8, 2019

MarcSkovMadsen commented Feb 6, 2020 • edited Loading

davidkh1255 commented May 20, 2020 • edited Loading

ZaxR commented Aug 31, 2018 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Sep 1, 2018 •

edited

Loading

silverdrake11 commented Oct 13, 2018 •

edited

Loading

MarcSkovMadsen commented Feb 6, 2020 •

edited

Loading

davidkh1255 commented May 20, 2020 •

edited

Loading