.. _how-to-io: .. Setting up files temporarily to be used in the examples below. Clear-up has to be done at the end of the document. .. testsetup:: >>> from numpy.testing import temppath >>> with open("csv.txt", "wt") as f: ... _ = f.write("1, 2, 3\n4,, 6\n7, 8, 9") >>> with open("fixedwidth.txt", "wt") as f: ... _ = f.write("1 2 3\n44 6\n7 88889") >>> with open("nan.txt", "wt") as f: ... _ = f.write("1 2 3\n44 x 6\n7 8888 9") >>> with open("skip.txt", "wt") as f: ... _ = f.write("1 2 3\n44 6\n7 888 9") >>> with open("tabs.txt", "wt") as f: ... _ = f.write("1\t2\t3\n44\t \t6\n7\t888\t9") =========================== Reading and writing files =========================== This page tackles common applications; for the full collection of I/O routines, see :ref:`routines.io`. Reading text and CSV_ files =========================== .. _CSV: https://en.wikipedia.org/wiki/Comma-separated_values With no missing values ---------------------- Use :func:`numpy.loadtxt`. With missing values ------------------- Use :func:`numpy.genfromtxt`. :func:`numpy.genfromtxt` will either - return a :ref:`masked array` **masking out missing values** (if ``usemask=True``), or - **fill in the missing value** with the value specified in ``filling_values`` (default is ``np.nan`` for float, -1 for int). With non-whitespace delimiters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> with open("csv.txt", "r") as f: ... print(f.read()) 1, 2, 3 4,, 6 7, 8, 9 Masked-array output +++++++++++++++++++ >>> np.genfromtxt("csv.txt", delimiter=",", usemask=True) masked_array( data=[[1.0, 2.0, 3.0], [4.0, --, 6.0], [7.0, 8.0, 9.0]], mask=[[False, False, False], [False, True, False], [False, False, False]], fill_value=1e+20) Array output ++++++++++++ >>> np.genfromtxt("csv.txt", delimiter=",") array([[ 1., 2., 3.], [ 4., nan, 6.], [ 7., 8., 9.]]) Array output, specified fill-in value +++++++++++++++++++++++++++++++++++++ >>> np.genfromtxt("csv.txt", delimiter=",", dtype=np.int8, filling_values=99) array([[ 1, 2, 3], [ 4, 99, 6], [ 7, 8, 9]], dtype=int8) Whitespace-delimited ~~~~~~~~~~~~~~~~~~~~ :func:`numpy.genfromtxt` can also parse whitespace-delimited data files that have missing values if * **Each field has a fixed width**: Use the width as the `delimiter` argument. # File with width=4. The data does not have to be justified (for example, # the 2 in row 1), the last column can be less than width (for example, the 6 # in row 2), and no delimiting character is required (for instance 8888 and 9 # in row 3) >>> with open("fixedwidth.txt", "r") as f: ... data = (f.read()) >>> print(data) 1 2 3 44 6 7 88889 # Showing spaces as ^ >>> print(data.replace(" ","^")) 1^^^2^^^^^^3 44^^^^^^6 7^^^88889 >>> np.genfromtxt("fixedwidth.txt", delimiter=4) array([[1.000e+00, 2.000e+00, 3.000e+00], [4.400e+01, nan, 6.000e+00], [7.000e+00, 8.888e+03, 9.000e+00]]) * **A special value (e.g. "x") indicates a missing field**: Use it as the `missing_values` argument. >>> with open("nan.txt", "r") as f: ... print(f.read()) 1 2 3 44 x 6 7 8888 9 >>> np.genfromtxt("nan.txt", missing_values="x") array([[1.000e+00, 2.000e+00, 3.000e+00], [4.400e+01, nan, 6.000e+00], [7.000e+00, 8.888e+03, 9.000e+00]]) * **You want to skip the rows with missing values**: Set `invalid_raise=False`. >>> with open("skip.txt", "r") as f: ... print(f.read()) 1 2 3 44 6 7 888 9 >>> np.genfromtxt("skip.txt", invalid_raise=False) # doctest: +SKIP __main__:1: ConversionWarning: Some errors were detected ! Line #2 (got 2 columns instead of 3) array([[ 1., 2., 3.], [ 7., 888., 9.]]) * **The delimiter whitespace character is different from the whitespace that indicates missing data**. For instance, if columns are delimited by ``\t``, then missing data will be recognized if it consists of one or more spaces. >>> with open("tabs.txt", "r") as f: ... data = (f.read()) >>> print(data) 1 2 3 44 6 7 888 9 # Tabs vs. spaces >>> print(data.replace("\t","^")) 1^2^3 44^ ^6 7^888^9 >>> np.genfromtxt("tabs.txt", delimiter="\t", missing_values=" +") array([[ 1., 2., 3.], [ 44., nan, 6.], [ 7., 888., 9.]]) Read a file in .npy or .npz format ================================== Choices: - Use :func:`numpy.load`. It can read files generated by any of :func:`numpy.save`, :func:`numpy.savez`, or :func:`numpy.savez_compressed`. - Use memory mapping. See `numpy.lib.format.open_memmap`. Write to a file to be read back by NumPy ======================================== Binary ------ Use :func:`numpy.save`, or to store multiple arrays :func:`numpy.savez` or :func:`numpy.savez_compressed`. For :ref:`security and portability `, set ``allow_pickle=False`` unless the dtype contains Python objects, which requires pickling. Masked arrays :any:`can't currently be saved `, nor can other arbitrary array subclasses. Human-readable -------------- :func:`numpy.save` and :func:`numpy.savez` create binary files. To **write a human-readable file**, use :func:`numpy.savetxt`. The array can only be 1- or 2-dimensional, and there's no ` savetxtz` for multiple files. Large arrays ------------ See :ref:`how-to-io-large-arrays`. Read an arbitrarily formatted binary file ("binary blob") ========================================================= Use a :doc:`structured array `. **Example:** The ``.wav`` file header is a 44-byte block preceding ``data_size`` bytes of the actual sound data:: chunk_id "RIFF" chunk_size 4-byte unsigned little-endian integer format "WAVE" fmt_id "fmt " fmt_size 4-byte unsigned little-endian integer audio_fmt 2-byte unsigned little-endian integer num_channels 2-byte unsigned little-endian integer sample_rate 4-byte unsigned little-endian integer byte_rate 4-byte unsigned little-endian integer block_align 2-byte unsigned little-endian integer bits_per_sample 2-byte unsigned little-endian integer data_id "data" data_size 4-byte unsigned little-endian integer The ``.wav`` file header as a NumPy structured dtype:: wav_header_dtype = np.dtype([ ("chunk_id", (bytes, 4)), # flexible-sized scalar type, item size 4 ("chunk_size", "`_.) .. _how-to-io-large-arrays: Write or read large arrays ========================== **Arrays too large to fit in memory** can be treated like ordinary in-memory arrays using memory mapping. - Raw array data written with :func:`numpy.ndarray.tofile` or :func:`numpy.ndarray.tobytes` can be read with :func:`numpy.memmap`:: array = numpy.memmap("mydata/myarray.arr", mode="r", dtype=np.int16, shape=(1024, 1024)) - Files output by :func:`numpy.save` (that is, using the numpy format) can be read using :func:`numpy.load` with the ``mmap_mode`` keyword argument:: large_array[some_slice] = np.load("path/to/small_array", mmap_mode="r") Memory mapping lacks features like data chunking and compression; more full-featured formats and libraries usable with NumPy include: * **HDF5**: `h5py `_ or `PyTables `_. * **Zarr**: `here `_. * **NetCDF**: :class:`scipy.io.netcdf_file`. For tradeoffs among memmap, Zarr, and HDF5, see `pythonspeed.com `_. Write files for reading by other (non-NumPy) tools ================================================== Formats for **exchanging data** with other tools include HDF5, Zarr, and NetCDF (see :ref:`how-to-io-large-arrays`). Write or read a JSON file ========================= NumPy arrays are **not** directly `JSON serializable `_. .. _how-to-io-pickle-file: Save/restore using a pickle file ================================ Avoid when possible; :doc:`pickles ` are not secure against erroneous or maliciously constructed data. Use :func:`numpy.save` and :func:`numpy.load`. Set ``allow_pickle=False``, unless the array dtype includes Python objects, in which case pickling is required. Convert from a pandas DataFrame to a NumPy array ================================================ See :meth:`pandas.DataFrame.to_numpy`. Save/restore using `~numpy.ndarray.tofile` and `~numpy.fromfile` ================================================================ In general, prefer :func:`numpy.save` and :func:`numpy.load`. :func:`numpy.ndarray.tofile` and :func:`numpy.fromfile` lose information on endianness and precision and so are unsuitable for anything but scratch storage. .. testcleanup:: >>> import os >>> # list all files created in testsetup. If needed there are >>> # convenienes in e.g. astroquery to do this more automatically >>> for filename in ['csv.txt', 'fixedwidth.txt', 'nan.txt', 'skip.txt', 'tabs.txt']: ... os.remove(filename)