Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-memory to_csv compression #22555

Closed
ZaxR opened this issue Aug 31, 2018 · 11 comments · Fixed by #35129
Closed

In-memory to_csv compression #22555

ZaxR opened this issue Aug 31, 2018 · 11 comments · Fixed by #35129
Labels
Enhancement IO CSV read_csv, to_csv
Milestone

Comments

@ZaxR
Copy link

ZaxR commented Aug 31, 2018

Code Sample, a copy-pastable example if possible

# Attempt 1
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
test = df.to_csv(compression="gzip")
type(test)
RuntimeWarning: compression has no effect when passing file-like object as input.
Out: str
# Attempt 2
from io import BytesIO
b_buf = BytesIO()
df.to_csv(b_buf, compression="gzip")
Out: TypeError: a bytes-like object is required, not 'str'

Problem description

I am trying to gzip compress a dataframe in memory (as opposed to directly to a named file location). The use case for this is (I imagine) similar to the reason by to_csv now allows not specifying a path in other cases to create an in memory representation, but specifically my case is that I need to save the compressed df to a cloud location using a custom URI, and I'm temporarily keeping it in memory for that purpose.

Expected Output

I would expect the compression option to result in a compressed, bytes object (similar to the gzip library).

Thank you in advance for your help!

Note: I originally saw #21227 (df.to_csv ignores compression when provided with a file handle), and thought it might have also been a fix, but looks like it just stopped a little short of fixing my issue as well.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Aug 31, 2018
@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2018

Interesting proposal! As a (temporary) workaround, could you not save to disk and then read into memory by any chance?

BTW, if you have ideas on how to implement in-memory compression, go for it!

@ZaxR
Copy link
Author

ZaxR commented Sep 6, 2018

Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn't ideal (can't be sure what the end-user's local environment will look like).

My thought was something like this as a workaround:

import gzip
from io import BytesIO
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
b_buf = BytesIO()
with gzip.open(b_buf, 'wb') as f:
    f.write(df.to_string().encode())

@gfyoung
Copy link
Member

gfyoung commented Sep 6, 2018

@ZaxR : Gotcha. Interesting...I think it would be a solid enhancement nonetheless. Would be open to proposals for implementation.

@dhimmel
Copy link
Contributor

dhimmel commented Sep 21, 2018

I agree that the compression argument should take effect when a file-like object is passed. This enhancement would likely include implementing zip writing support in _get_handle, which would address the frustration I had in #22011 (comment).

@silverdrake11
Copy link

silverdrake11 commented Oct 13, 2018

For the 'gzip' compression, _get_handle() is not being called when BytesIO() is passed. This causes it to fail at csv.writer(gzip.GzipFile(fileobj=BytesIO())) in csvs.py

If _get_handle() is called on BytesIO() then what happens is csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO()))) which fails b/c GzipFile opens it as read only. Setting the mode will work csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO(), mode=mode)))

The 'bz2' compression fix is the same. 'xz' will not compress BytesIO() unless LZMACompressor is used. 'zip' has the custom workflow referenced by dhimmel, which complicates it further.

There is too much logic in _get_handle, and it is called many times for reading and for writing. One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

In order to actually call _get_handle()on BytesIO(), the elif hasattr(self.path_or_buf, 'write') in csvs.py has to be changed so that BytesIO() doesn't end up there but StringIO() does. For Python 3 this is enough to fix it.

For Python 2, the exception about not supporting a custom encoding gets raised in _get_handle. This is b/c CSVFormater() sets encoding='ascii' while _get_handle expects it to be None which is actually 'ascii'.

This is the test code I was using:

hello = BytesIO()
test = df.to_csv(hello, compression='gzip')
print(hello.getvalue())

@dhimmel
Copy link
Contributor

dhimmel commented Oct 15, 2018

There is too much logic in _get_handle, and it is called many times for reading and for writing.

I agree. Especially since there is no docstring to define what the function intends to support.

One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

I agree it may be helpful to split read/write and 2/3. However, with 2019 only a couple months away, the purging of 2 from pandas is just around the corner. It seems like any major changes should plan for a 3-only codebase (and hence benefit from the great simplification)?

@gfyoung
Copy link
Member

gfyoung commented Oct 15, 2018

@dhimmel : I agree that the implementation should be 3-oriented. If it's 2-oriented as well, great! If not, I would still write it out but just hold off on the PR until the turn of the calendar year.

@ZaxR
Copy link
Author

ZaxR commented Feb 8, 2019

@dhimmel and @gfyoung - Now that we've reached the new year/pandas release, any update on this enhancement? Would really love this capability :)

@gfyoung
Copy link
Member

gfyoung commented Feb 8, 2019

@ZaxR : Thanks for the ping! We're still in the process of releasing versions that are Python-2-compatible, so we might want to hold on this a little longer. That being said, proposals are a pure Python-3-compatible implementation would be great 👍

@MarcSkovMadsen
Copy link

MarcSkovMadsen commented Feb 6, 2020

I was also looking for the compress functionality in order to produce base64 encoded links.

@staticmethod
def _to_base64_encoded_link(data: pd.DataFrame):
    csv = data.to_csv(index=False)
    b64 = base64.b64encode(
        csv.encode()
    ).decode()  # some strings <-> bytes conversions necessary here
    link = f'<a href="data:file/csv;base64,{b64}" download="data.csv">Download</a>'
    return link

Currently my dataframes are to big. So I would like to compress them.

fberanizo added a commit to platiagro/sdk that referenced this issue Apr 17, 2020
* Moves column from metadata to body

Max metadata size is 8kb and became a problem when the dataset has
lots of columns.

* Removes redundant parameter

encoding='utf-8' is the default value

* New metadata storage solution

MinIO/S3 object metadata has lots of problems:
max 8kb size, allows only Dict[str, str], ...

We had a problem saving a file with lots of columns (~1000) and
we couldn't store feature types as metadata since it exceeded the
8kb limit.

As a solution, now the columns are the 1st row in the csv file
and the metadata is in a separate file encoded in JSON format.

The unit test was modified to a large dataset 1000 cols, 1mi rows

* Removes compression, since it does not work yet on pandas

See: pandas-dev/pandas#22555

* Fix list_datasets issue

Two files are created for each dataset.

* Decreases mock_dataset size

Causing 502 bad gateway on play.min.io

* Improves code when reading saved metrics
@davidkh1255
Copy link

davidkh1255 commented May 20, 2020

I modified some code which seems to produce the bz2 for me and appears to avoid the intermediate on-disk csv file. Does this help you guys? I'm not sure how efficient the memory allocation is or how in-memory it all is, so if there are optimizations, please advise! I'm a py newbie (first day), and this stuff seemed to be poorly documented, so thanks for making this thread and apologies for my ignorance.

#This code takes a pandas df and makes clickable link in your ipynb UI to download a bz2 compressed file
#Updated 2020-05-19 davidkh1255

#Housekeeping - BEGIN
import pandas as pd
import bz2
import base64
from IPython.display import HTML
#Housekeeping - END

#Create test pandas dataframe from example in 22555, and add D col and data
mypd = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12], "D":[90,91,92,93]})

#Note: this requires a pandas df as input
def create_download_link( df, title = "Download bz2 file", filename = "data.bz2"):
csv = df.to_csv(index=False) # turn off index numbers
bz = bz2.BZ2Compressor() # setup the BZ2Compressor
#b64 = base64.b64encode(csv.encode()) # to not use bz2, uncomment this and comment out next line
b64 = base64.b64encode(bz.compress(csv.encode())+bz.flush()) # do not forget to flush otherwise output will be zero byte file
payload = b64.decode()
html = '{title}'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
#END def create_download_link

#Call the function with your pandas df
create_download_link(mypd)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants