Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterating through TableIterator with where clause can incorrectly ignore data #8014

Closed
bboerner opened this issue Aug 13, 2014 · 12 comments · Fixed by #8029
Closed

Iterating through TableIterator with where clause can incorrectly ignore data #8014

bboerner opened this issue Aug 13, 2014 · 12 comments · Fixed by #8029
Labels
Bug IO HDF5 read_hdf, HDFStore
Milestone

Comments

@bboerner
Copy link

Expected behaviour: Using appendable table stored using HDFStore summed length of DataFrames returned using an iterator with a where clause should equal the length of the DataFrame when returned using the same where clause but with iterator=False e.g. TableIterator.get_values().

The attached code generates appendable tables of size 100064, 200064, ..., 400064. It uses a where clause which is a superset of all possible values to get DataFrames with iterator=False, with and without the where clause, and with iterator=True, also with and without the where clause. In all cases except for iterator=True with the where clause the length of the returned DataFrames is correct.

For the failure cases in closer inspection in iPython it is the last 64 rows which are not being returned.

Note: in create_file() the appending of DataFrames with lengths of 58689 and 41375 was chosen specifically to reproduce the problem. I originally encountered the problem with a dataset with length 174000064 and the last append was size 41375. I attempted to reproduce the problem by creating various length tables in chunks of 100000 with a final append of 64 and wasn't able to do so.

Creating the table with the last chunk = 41375 with total length exceeding 300000 does in my tests reproduce the problem.

Output:

iteration: 0 PASSED
expected: 100064, df len: 100064, it (no where clause) len: 100064, it len: 100064
iteration: 1 PASSED
expected: 200064, df len: 200064, it (no where clause) len: 200064, it len: 200064
iteration: 2 FAILED
expected: 300064, df len: 300064, it (no where clause) len: 300064, it len: 300000
iteration: 3 FAILED
expected: 400064, df len: 400064, it (no where clause) len: 400064, it len: 400000

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-32-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 1.2.1
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 3.3.3
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None

import os
from dateutil.relativedelta import relativedelta

import numpy as np
randn = np.random.randn
from pandas import DataFrame, HDFStore, date_range

def create_df(beg_dt, periods=1e5):
    """ Create a DataFrame containing values v. """

    dr = date_range(beg_dt, periods=periods, freq='S')
    df = DataFrame(index=dr, data=np.random.randn(periods, 4), columns=['bid_price','bid_vol', 'ask_price','ask_vol'])
    return(df)

def create_file(iterations=1):
    beg_dt = '2014-08-12 13:30:00.000000'
    periods = 1e5
    for i in xrange(iterations):
        df = create_df(beg_dt, periods)
        store_append(store, df, key="df")
        beg_dt = df.index[-1:][0] + relativedelta(seconds=1)

    df = create_df(beg_dt, 58689)
    store_append(store, df, key="df")
    beg_dt = df.index[-1:][0] + relativedelta(seconds=1)
    df = create_df(beg_dt, 41375)
    store_append(store, df, key="df")
    beg_dt = df.index[-1:][0] + relativedelta(seconds=1)

    return(df)

def store_open(fname):
    return(HDFStore(fname))

def store_get(store, key="df", where=None, start=None, stop=None, iterator=False, chunksize=None):
    df = None
    try:
        df = store.select(key, where=where, start=start, stop=stop, iterator=iterator, chunksize=chunksize)
    except (KeyError, TypeError, ):
        pass

    return(df)

def store_append(store, df, key="df", where=""):
    store.append(key, df, format='table')

path = '.'
fname = '/'.join([path, 'delme_test.h5'])

store = None
for n in xrange(0, 4):
    pass

    if store:
        try: store.close()
        except: pass
    try: os.unlink(fname)
    except: pass
    store = store_open(fname)

    create_file(n)
    store.close()

    store = store_open(fname)
    where = None
    df = store_get(store, 'df', where=where, iterator=False)

    expected_ln = len(df)

    beg_dt = '2014-08-12 13:30:00.000000'
    end_dt = '2032-12-31 13:30:00.000000'
    where = "index >= '%s' & index <= '%s'" % (beg_dt, end_dt)

    # where clause, iterator=False
    df = store_get(store, 'df', where=where, iterator=False)
    ln_df = len(df)

    # no where clause
    it = store_get(store, 'df', where=None, iterator=True)
    dfs = [df for df in it if not df.empty]
    ln_it_no_where_clause = sum([len(df) for df in dfs])

    # where clause, iterator=True
    it = store_get(store, 'df', where=where, iterator=True)
    dfs = [df for df in it if not df.empty]
    ln_it = sum([len(df) for df in dfs])

    if expected_ln == ln_df and expected_ln == ln_it:
        print("iteration: %d PASSED" % n)
    else:
        print("iteration: %d FAILED" % n)
    print("expected: %d, df len: %d, it (no where clause) len: %d, it len: %d" %
        (expected_ln, ln_df, ln_it_no_where_clause, ln_it))

store.close()
@jreback
Copy link
Contributor

jreback commented Aug 13, 2014

you would need to narrow this down to a small reproducible example.

  • create a test frame
  • write to the store
  • read in chunks so the where should encompass all values
  • concat the chunks to form a composite frame
  • compare versus the original

@jreback jreback added the HDF5 label Aug 13, 2014
@bboerner
Copy link
Author

This is the reproducible example which I did narrow down. As I noted simply writing test frames of length chunksize * some multiple followed by a DataFrame of len 64 does not reproduce the problem.

I explained that the problem only occurs once 200000 + 58689 + 41375 rows are written.

Comparing the concatted frame to the original will fail because the concatted frame is missing the last 64 elements as I noted.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2014

@bboerner

in order for this to be tested it has to be constructed as a test.

Once you have a test that reliably fails then you can debug.

Pls see the format here: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_pytables.py

@jreback jreback modified the milestone: 0.15.1 Aug 13, 2014
@jreback
Copy link
Contributor

jreback commented Aug 13, 2014

It looks like a bug, but it could be that your selection is simply incorrect. Hard to test w/o a reliable test.
Simply comparing the length is not enough. What if each chunk is off by 1 in your selection (I am not saying it is), and the length compares equal (but is in fact right)? or vice-versa.

@bboerner
Copy link
Author

@jreback, understood, I'll work up a test. Thanks.

@bboerner
Copy link
Author

@jreback,

This can be added to https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_pytables.py and reproduces the problem.

Regards.

    def test_select_iterator_8014(self):

        # single table
        chunksize=1e4
        with ensure_clean_store(self.path) as store:

            df = tm.makeTimeDataFrame(100064, 'S')
            _maybe_remove(store, 'df')
            i = 0; ln = 58689
            store.append('df', df[i:i+ln])
            i = i+ln; ln = 41375
            store.append('df', df[i:i+ln])
            expected = df

            beg_dt = expected.index[0]
            end_dt = expected.index[-1]

            #
            # w/o iterator
            #

            # select w/o iteration and no where clause works
            result = store.select('df')
            tm.assert_frame_equal(expected, result)

            # select w/o iterator and where clause, single term, begin
            # of range, works
            where = "index >= '%s'" % beg_dt
            result = store.select('df',where=where)
            tm.assert_frame_equal(expected, result)

            # select w/o iterator and where clause, single term, end
            # of range, works
            where = "index <= '%s'" % end_dt
            result = store.select('df',where=where)
            tm.assert_frame_equal(expected, result)

            # select w/o iterator and where clause, inclusive range,
            # works
            where = "index >= '%s' & index <= '%s'" % (beg_dt, end_dt)
            result = store.select('df',where=where)
            tm.assert_frame_equal(expected, result)

            #
            # with iterator
            #

            # select w/iterator and no where clause works
            results = []
            for s in store.select('df',iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

            # select w/iterator and where clause, single term, begin
            # of range, fails
            where = "index >= '%s'" % beg_dt
            results = []
            for s in store.select('df',where=where,iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

            # select w/iterator and where clause, single term, end of
            # range, fails
            where = "index <= '%s'" % end_dt
            results = []
            for s in store.select('df',where=where,iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

            # select w/iterator and where clause, inclusive range, fails
            where = "index >= '%s' & index <= '%s'" % (beg_dt, end_dt)
            results = []
            for s in store.select('df',where=where,iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

@jreback
Copy link
Contributor

jreback commented Aug 14, 2014

@bboerner ok, was very subtle, fixed in #8029

essentially if a chunk happened to select more than the chunksize (which happens in this example when the date crosses a day boundary). This might actually be a bug in pytables, not sure.

In any event can completely bypass this but simply selecting the coordinates with the where up front, then using the chunksize to iterate over them (and selecting the data).

also this then conforms to the docs where you will get chunksize in every chunk (possibly less in the last one). OF THE RESULTS.

@bboerner
Copy link
Author

@jreback, thanks verified fix. Regards.

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 14, 2014
bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 14, 2014
bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 14, 2014
@bboerner
Copy link
Author

@jreback, testing fix with my app I'm seeing another problem. I'll debug further and look for a reproducible test case.

@jreback
Copy link
Contributor

jreback commented Aug 14, 2014

ok, I haven't merged this yet as I cleaned up the code (but it passes the same tests).

@bboerner
Copy link
Author

@jreback, I've updated the test case to check for subset of data - the result will be a "hang" in the list comprehension (the iterator never exits). I committed this to: bboerner@8e24e8a or can submit a pull request if you prefer (ignore the commit of test_pytables2.py ; that was a subset I was using and didn't mean to commit). Let me know.

Regards.

@bboerner
Copy link
Author

Fix test case: bboerner@7f7e1b5

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 15, 2014
bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 15, 2014
This reverts commit 3fcb774.

Conflicts:
	pandas/io/tests/test_pytables2.py
bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 15, 2014
This reverts commit b8447d7.

Conflicts:
	pandas/io/tests/test_pytables2.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants