Chunk reads in format.read_array.

Maximum data size limitations in the crc32 module cause errors when reading more than 2 ** 32 bytes from gzip streams. Work around this issue when reading large arrays from npz files by chunking reads to 256mb. This appears to resolve bug #2922.
author: Alex Ford <fordas@uw.edu> 2013-01-30 19:44:14 -0800
committer: Bartosz Telenczuk <muchatel@poczta.fm> 2013-06-12 13:34:28 +0200
commit: b69c48d34d6b6d9be01f37bd5117e946e2556df8 (patch)
tree: c4f36abae1fe14dfea7e3166dcc0ace4ecfe9215 /numpy/lib/format.py
parent: cfae0143b436c3296eebe71e2dd730625dcaae95 (diff)
download: numpy-b69c48d34d6b6d9be01f37bd5117e946e2556df8.tar.gz
1 files changed, 14 insertions, 3 deletions
diff --git a/numpy/lib/format.py b/numpy/lib/format.py
index 81e8cd010..de84d2820 100644
--- a/numpy/lib/format.py
+++ b/numpy/lib/format.py
@@ -457,9 +457,20 @@ def read_array(fp):
         else:
             # This is not a real file. We have to read it the memory-intensive
             # way.
-            # XXX: we can probably chunk this to avoid the memory hit.
-            data = fp.read(int(count * dtype.itemsize))
-            array = numpy.fromstring(data, dtype=dtype, count=count)
+            # crc32 module fails on reads greater than 2 ** 32 bytes, breaking large reads from gzip streams
+            # Chunk reads to 256mb to avoid issue and reduce memory overhead of the read.
+            # In non-chunked case count < max_read_count, so only one read is performed.
+
+            max_buffer_size = 2 ** 28
+            max_read_count = max_buffer_size / dtype.itemsize
+
+            array = numpy.empty(count, dtype=dtype)
+
+            for i in xrange(0, count, max_read_count):
+                read_count = max_read_count if i + max_read_count < count else count - i
+
+                data = fp.read(int(read_count * dtype.itemsize))
+                array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, count=read_count)
 
         if fortran_order:
             array.shape = shape[::-1]
author	Alex Ford <fordas@uw.edu>	2013-01-30 19:44:14 -0800
committer	Bartosz Telenczuk <muchatel@poczta.fm>	2013-06-12 13:34:28 +0200
commit	b69c48d34d6b6d9be01f37bd5117e946e2556df8 (patch)
tree	c4f36abae1fe14dfea7e3166dcc0ace4ecfe9215 /numpy/lib/format.py
parent	cfae0143b436c3296eebe71e2dd730625dcaae95 (diff)
download	numpy-b69c48d34d6b6d9be01f37bd5117e946e2556df8.tar.gz