Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different HDFStores in multiple threads crashes Python #2397

Closed
pag opened this issue Nov 30, 2012 · 6 comments
Closed

Different HDFStores in multiple threads crashes Python #2397

pag opened this issue Nov 30, 2012 · 6 comments
Labels
Multithreading Parallelism in pandas
Milestone

Comments

@pag
Copy link

pag commented Nov 30, 2012

import threading
import pandas as pd
import time

def foo():
    store = pd.HDFStore('my_hdf_file.h5')
    store['foo']
    store.close()


def main():
    threading.Thread(target=foo).start()
    threading.Thread(target=foo).start()
    time.sleep(2)

if __name__ == '__main__':
    main()

Crashes for me (Windows 7 using pytables 2.4.0 and pandas 0.9.1 from http://www.lfd.uci.edu/~gohlke/pythonlibs/). I can't get the stack trace easily, I can try harder if necessary. Simply using tables.openFile and reading a few values seems to work fine.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2012

The underlying storage mechanism, PyTables is inherently not threadsafe for WRITES. HDFStore opens the store file with mode 'a' (append), by default, so this is trying to open 2 writers. Try opening in read mode.

store = pd.HDFStore('my_hdf_file.h5', mode = 'r')

I will add a note to the docs....as this is also a problem in multi-processing (concurrent reads ok, but writing and reading at the same time is a problem)

http://pl.digipedia.org/usenet/thread/16072/93/

@pag
Copy link
Author

pag commented Nov 30, 2012

Sorry, my original example was a reader. The crash still happens with mode='r'.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2012

I tried your example (after I created the h5 file), and with mode = 'r'

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/code/arb/test/pytables-threading.py", line 8, in foo
    store.close()
  File "/usr/local/lib/python2.7/site-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 263, in close
    self.handle.close()
  File "/usr/local/lib/python2.7/site-packages/tables/file.py", line 2162, in close
    del _open_files[filename]
KeyError: 'my_hdf_file.h5'

so in the PyTables layer it is trying to close a file which it thinks is open already. this is a bug in PyTables, see this thread. I guess its not thread-safe even for reads

PyTables/PyTables#130

using with doesn't help either

from pandas.io.pytables import get_store
def foo():
    with get_store('my_hdf_file.h5', mode = 'r') as store:
        store['foo']
        store.close()

I would just say avoid opening/using the file in multi-threads. I have found no issues using read only in multiple processes however.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2012

the following example works correctly. I think if you open and close in the main thread, then you can concurrently read w/o a problem in other threads.(still avoid read/writing in more than 1 thread however)

import threading
import pandas as pd
import time

class Thread(threading.Thread):

    def __init__(self, store):
        threading.Thread.__init__(self)
        self.store = store

    def run(self):
        print self.store['foo']

def main():
    store = pd.HDFStore('my_hdf_file.h5')        
    t1 = Thread(store = store)
    t2 = Thread(store = store)
    t1.start()
    t2.start()
    time.sleep(2)
    t1.join()
    t2.join()
    store.close()

if __name__ == '__main__':
    store = pd.HDFStore('my_hdf_file.h5')
    store['foo'] = pd.Series(range(10))
    store.close()
    main()

@jreback
Copy link
Contributor

jreback commented Dec 6, 2012

@wesm
Copy link
Member

wesm commented Dec 6, 2012

We could potentially add locks to HDFStore at some point to prevent multiple threads from accessing the file at once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Multithreading Parallelism in pandas
Projects
None yet
Development

No branches or pull requests

3 participants