Iterating through TableIterator with where clause can incorrectly ignore data #8014

bboerner · 2014-08-13T04:04:40Z

Expected behaviour: Using appendable table stored using HDFStore summed length of DataFrames returned using an iterator with a where clause should equal the length of the DataFrame when returned using the same where clause but with iterator=False e.g. TableIterator.get_values().

The attached code generates appendable tables of size 100064, 200064, ..., 400064. It uses a where clause which is a superset of all possible values to get DataFrames with iterator=False, with and without the where clause, and with iterator=True, also with and without the where clause. In all cases except for iterator=True with the where clause the length of the returned DataFrames is correct.

For the failure cases in closer inspection in iPython it is the last 64 rows which are not being returned.

Note: in create_file() the appending of DataFrames with lengths of 58689 and 41375 was chosen specifically to reproduce the problem. I originally encountered the problem with a dataset with length 174000064 and the last append was size 41375. I attempted to reproduce the problem by creating various length tables in chunks of 100000 with a final append of 64 and wasn't able to do so.

Creating the table with the last chunk = 41375 with total length exceeding 300000 does in my tests reproduce the problem.

Output:

iteration: 0 PASSED
expected: 100064, df len: 100064, it (no where clause) len: 100064, it len: 100064
iteration: 1 PASSED
expected: 200064, df len: 200064, it (no where clause) len: 200064, it len: 200064
iteration: 2 FAILED
expected: 300064, df len: 300064, it (no where clause) len: 300064, it len: 300000
iteration: 3 FAILED
expected: 400064, df len: 400064, it (no where clause) len: 400064, it len: 400000

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-32-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 1.2.1
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 3.3.3
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None

import os
from dateutil.relativedelta import relativedelta

import numpy as np
randn = np.random.randn
from pandas import DataFrame, HDFStore, date_range

def create_df(beg_dt, periods=1e5):
    """ Create a DataFrame containing values v. """

    dr = date_range(beg_dt, periods=periods, freq='S')
    df = DataFrame(index=dr, data=np.random.randn(periods, 4), columns=['bid_price','bid_vol', 'ask_price','ask_vol'])
    return(df)

def create_file(iterations=1):
    beg_dt = '2014-08-12 13:30:00.000000'
    periods = 1e5
    for i in xrange(iterations):
        df = create_df(beg_dt, periods)
        store_append(store, df, key="df")
        beg_dt = df.index[-1:][0] + relativedelta(seconds=1)

    df = create_df(beg_dt, 58689)
    store_append(store, df, key="df")
    beg_dt = df.index[-1:][0] + relativedelta(seconds=1)
    df = create_df(beg_dt, 41375)
    store_append(store, df, key="df")
    beg_dt = df.index[-1:][0] + relativedelta(seconds=1)

    return(df)

def store_open(fname):
    return(HDFStore(fname))

def store_get(store, key="df", where=None, start=None, stop=None, iterator=False, chunksize=None):
    df = None
    try:
        df = store.select(key, where=where, start=start, stop=stop, iterator=iterator, chunksize=chunksize)
    except (KeyError, TypeError, ):
        pass

    return(df)

def store_append(store, df, key="df", where=""):
    store.append(key, df, format='table')

path = '.'
fname = '/'.join([path, 'delme_test.h5'])

store = None
for n in xrange(0, 4):
    pass

    if store:
        try: store.close()
        except: pass
    try: os.unlink(fname)
    except: pass
    store = store_open(fname)

    create_file(n)
    store.close()

    store = store_open(fname)
    where = None
    df = store_get(store, 'df', where=where, iterator=False)

    expected_ln = len(df)

    beg_dt = '2014-08-12 13:30:00.000000'
    end_dt = '2032-12-31 13:30:00.000000'
    where = "index >= '%s' & index <= '%s'" % (beg_dt, end_dt)

    # where clause, iterator=False
    df = store_get(store, 'df', where=where, iterator=False)
    ln_df = len(df)

    # no where clause
    it = store_get(store, 'df', where=None, iterator=True)
    dfs = [df for df in it if not df.empty]
    ln_it_no_where_clause = sum([len(df) for df in dfs])

    # where clause, iterator=True
    it = store_get(store, 'df', where=where, iterator=True)
    dfs = [df for df in it if not df.empty]
    ln_it = sum([len(df) for df in dfs])

    if expected_ln == ln_df and expected_ln == ln_it:
        print("iteration: %d PASSED" % n)
    else:
        print("iteration: %d FAILED" % n)
    print("expected: %d, df len: %d, it (no where clause) len: %d, it len: %d" %
        (expected_ln, ln_df, ln_it_no_where_clause, ln_it))

store.close()

The text was updated successfully, but these errors were encountered:

jreback · 2014-08-13T12:48:44Z

you would need to narrow this down to a small reproducible example.

create a test frame
write to the store
read in chunks so the where should encompass all values
concat the chunks to form a composite frame
compare versus the original

bboerner · 2014-08-13T19:20:50Z

This is the reproducible example which I did narrow down. As I noted simply writing test frames of length chunksize * some multiple followed by a DataFrame of len 64 does not reproduce the problem.

I explained that the problem only occurs once 200000 + 58689 + 41375 rows are written.

Comparing the concatted frame to the original will fail because the concatted frame is missing the last 64 elements as I noted.

jreback · 2014-08-13T19:23:11Z

@bboerner

in order for this to be tested it has to be constructed as a test.

Once you have a test that reliably fails then you can debug.

Pls see the format here: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_pytables.py

jreback · 2014-08-13T19:23:59Z

It looks like a bug, but it could be that your selection is simply incorrect. Hard to test w/o a reliable test.
Simply comparing the length is not enough. What if each chunk is off by 1 in your selection (I am not saying it is), and the length compares equal (but is in fact right)? or vice-versa.

bboerner · 2014-08-13T19:39:21Z

@jreback, understood, I'll work up a test. Thanks.

bboerner · 2014-08-14T05:23:24Z

@jreback,

This can be added to https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_pytables.py and reproduces the problem.

Regards.

    def test_select_iterator_8014(self):

        # single table
        chunksize=1e4
        with ensure_clean_store(self.path) as store:

            df = tm.makeTimeDataFrame(100064, 'S')
            _maybe_remove(store, 'df')
            i = 0; ln = 58689
            store.append('df', df[i:i+ln])
            i = i+ln; ln = 41375
            store.append('df', df[i:i+ln])
            expected = df

            beg_dt = expected.index[0]
            end_dt = expected.index[-1]

            #
            # w/o iterator
            #

            # select w/o iteration and no where clause works
            result = store.select('df')
            tm.assert_frame_equal(expected, result)

            # select w/o iterator and where clause, single term, begin
            # of range, works
            where = "index >= '%s'" % beg_dt
            result = store.select('df',where=where)
            tm.assert_frame_equal(expected, result)

            # select w/o iterator and where clause, single term, end
            # of range, works
            where = "index <= '%s'" % end_dt
            result = store.select('df',where=where)
            tm.assert_frame_equal(expected, result)

            # select w/o iterator and where clause, inclusive range,
            # works
            where = "index >= '%s' & index <= '%s'" % (beg_dt, end_dt)
            result = store.select('df',where=where)
            tm.assert_frame_equal(expected, result)

            #
            # with iterator
            #

            # select w/iterator and no where clause works
            results = []
            for s in store.select('df',iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

            # select w/iterator and where clause, single term, begin
            # of range, fails
            where = "index >= '%s'" % beg_dt
            results = []
            for s in store.select('df',where=where,iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

            # select w/iterator and where clause, single term, end of
            # range, fails
            where = "index <= '%s'" % end_dt
            results = []
            for s in store.select('df',where=where,iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

            # select w/iterator and where clause, inclusive range, fails
            where = "index >= '%s' & index <= '%s'" % (beg_dt, end_dt)
            results = []
            for s in store.select('df',where=where,iterator=True, chunksize=chunksize):
                results.append(s)
            result = concat(results)
            tm.assert_frame_equal(expected, result)

jreback · 2014-08-14T14:36:54Z

@bboerner ok, was very subtle, fixed in #8029

essentially if a chunk happened to select more than the chunksize (which happens in this example when the date crosses a day boundary). This might actually be a bug in pytables, not sure.

In any event can completely bypass this but simply selecting the coordinates with the where up front, then using the chunksize to iterate over them (and selecting the data).

also this then conforms to the docs where you will get chunksize in every chunk (possibly less in the last one). OF THE RESULTS.

bboerner · 2014-08-14T19:55:28Z

@jreback, thanks verified fix. Regards.

bboerner · 2014-08-14T21:43:54Z

@jreback, testing fix with my app I'm seeing another problem. I'll debug further and look for a reproducible test case.

jreback · 2014-08-14T21:45:09Z

ok, I haven't merged this yet as I cleaned up the code (but it passes the same tests).

bboerner · 2014-08-15T03:21:10Z

@jreback, I've updated the test case to check for subset of data - the result will be a "hang" in the list comprehension (the iterator never exits). I committed this to: bboerner@8e24e8a or can submit a pull request if you prefer (ignore the commit of test_pytables2.py ; that was a subset I was using and didn't mean to commit). Let me know.

Regards.

bboerner · 2014-08-15T03:23:54Z

Fix test case: bboerner@7f7e1b5

This reverts commit a6d0ec3.

This reverts commit 3fcb774. Conflicts: pandas/io/tests/test_pytables2.py

This reverts commit b8447d7. Conflicts: pandas/io/tests/test_pytables2.py

jreback added the HDF5 label Aug 13, 2014

jreback modified the milestone: 0.15.1 Aug 13, 2014

jreback mentioned this issue Aug 14, 2014

BUG: fix HDFStore iterator to handle a where properly (GH8014) #8029

Merged

jreback added the Bug label Aug 14, 2014

jreback added this to the 0.15.0 milestone Aug 14, 2014

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 14, 2014

1st draft of test case for pandas-dev#8014

a6d0ec3

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 14, 2014

2nd draft of test case for pandas-dev#8014

3fcb774

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 14, 2014

test case for pandas-dev#8014

b8447d7

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 15, 2014

Revert "1st draft of test case for pandas-dev#8014"

f55bb64

This reverts commit a6d0ec3.

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 15, 2014

Revert "2nd draft of test case for pandas-dev#8014"

60daff7

This reverts commit 3fcb774. Conflicts: pandas/io/tests/test_pytables2.py

bboerner pushed a commit to bboerner/pandas that referenced this issue Aug 15, 2014

Revert "test case for pandas-dev#8014"

5c9af20

This reverts commit b8447d7. Conflicts: pandas/io/tests/test_pytables2.py

jreback closed this as completed in #8029 Aug 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating through TableIterator with where clause can incorrectly ignore data #8014

Iterating through TableIterator with where clause can incorrectly ignore data #8014

bboerner commented Aug 13, 2014

jreback commented Aug 13, 2014

bboerner commented Aug 13, 2014

jreback commented Aug 13, 2014

jreback commented Aug 13, 2014

bboerner commented Aug 13, 2014

bboerner commented Aug 14, 2014

jreback commented Aug 14, 2014

bboerner commented Aug 14, 2014

bboerner commented Aug 14, 2014

jreback commented Aug 14, 2014

bboerner commented Aug 15, 2014

bboerner commented Aug 15, 2014

Iterating through TableIterator with where clause can incorrectly ignore data #8014

Iterating through TableIterator with where clause can incorrectly ignore data #8014

Comments

bboerner commented Aug 13, 2014

INSTALLED VERSIONS

jreback commented Aug 13, 2014

bboerner commented Aug 13, 2014

jreback commented Aug 13, 2014

jreback commented Aug 13, 2014

bboerner commented Aug 13, 2014

bboerner commented Aug 14, 2014

jreback commented Aug 14, 2014

bboerner commented Aug 14, 2014

bboerner commented Aug 14, 2014

jreback commented Aug 14, 2014

bboerner commented Aug 15, 2014

bboerner commented Aug 15, 2014