Installation

These instructions are sufficient to build and install PlasmaStMan, making it readily available for usage by third-party applications.

While these instructions are enough to get started, attention needs to be paid when planning to use this storage manager from a python environment. For details see Python quirks.

Dependencies

This project depends on:

  • Any C++14 compiler

  • casacore > 3.3.0 (with 64-bit table support)

  • arrow >= 1.0.0-SNAPSHOT with plasma support

Compiling

This is a cmake-based project, so it can be built as any standard cmake project:

$ git clone https://gitlab.com/ska-telescope/ska-sdp-plasmastman
$ cd plasma-storage-manager
$ cmake . -B build
# cmake --build build

Some of the most relevant cmake variables (passed on the first cmake invocation via -Dvariable=value) used for compiling are:

  • CASACORE_ROOT_DIR: Root of arbitrary casacore installations in case one is used

  • Arrow_DIR: directory containing the cmake configuration exported by Apache Arrow (usually under lib/cmake/arrow in the arrow installation area).

  • Plasma_DIR: directory containing the cmake configuration exported by Apache Plasma (usually under lib/cmake/arrow in the arrow installation area).

  • CMAKE_CXX_COMPILER: The C++ compiler to use.

  • CMAKE_CXX_FLAGS: Extra C++ compilation flags.

  • CMAKE_BUILD_TYPE: The type of build to produce, one of Debug, Release and RelWithDebInfo.

  • BUILD_TESTING: Whether to build unit tests or not, defaults to ON.

Testing

A set of unit tests is included and built by default. To execute them do:

$ cmake --build build --target test

The unit tests require the plasma-store-server executable (part of a standard C++ Arrow Plasma installation) to be visible in the path.

If you want further control on ctest’s command line flags you can do:

$ cmake --build build --target test -- ARGS="<ctest command line flags>"

or alternatively:

$ cd build/
$ ctest <ctest command line flags>

Python quirks

When using PlasmaStMan from python, special attention needs to be paid to how the python-casacore and pyarrow python packages, if needed by your python code, are installed to avoid some otherwise difficult to debug errors.

python-casacore

TL;DR:

  • Don’t install the pre-built binary wheels from PyPI.

  • If you can, use the kernsuite repositories to install the casacore libraries and python-casacore python package from pre-built apt packages.

  • If installing from kernsuite is not an option, then ensure python-casacore is built against the same casacore installation PlasmaStMan was built against.

Starting from version 3.4.0, the python-casacore package offers pre-built binary wheels for some major OS and python version combinations. These binary wheels come bundled with a copy of the underlying casacore libraries (libcasa_casa.so, libcasa_tables.so, etc) and their dependencies. Each of these bundled libraries actually have a specific SONAME s and matching filesname (e.g. libcasa_tables-734048a7.so.6), thus avoiding interfering with any system-wide installation.

On the other hand, the plug-in mechanism used to register third-party storage managers with casacore involves first loading the storage manager shared library into memory, then invoking a registration function in the library that registers itself into a static casacore-owned registration map, and finally checking that the registration was successful. This usually looks like this:

+-------------+      1. dlopen()     +----------------+
| casacore.so | -------------------> | plasmastman.so |
+-------------+                      +----------------+
 ^  |  ^  |                               ^  |
 |  |  |  |   2. register_plasmastman()   |  |
 |  |  |  \-------------------------------/  |
 |  |  |                                     |
 |  |  |     3. DataMan::registerCtor()      |
 |  |  \-------------------------------------/
 |  |
 \--/  4. check_registration() // all good :)

However when using the binary wheels from PyPI, and because of the difference in SONAME between the bundled libraries and the libraries used to compile the storage manager, two different copies of casacore.so are loaded into memory, and the interaction looks like this:

+----------------------+   1. dlopen()     +----------------+  1.1 dlopen()  +-------------+
| casacore-734048a7.so | ----------------> | plasmastman.so | -------------> | casacore.so |
+----------------------+                   +----------------+                +-------------+
          ^  |     |                           ^  |                                 ^
          |  |     | 2. register_plasmastman() |  |                                 |
          |  |     \---------------------------/  |                                 |
          |  |                                    |                                 |
          |  |                                    |    3. DataMan::registerCtor()   |
          |  |                                    \---------------------------------/
          |  |
          \--/  4. check_registration() // fails, registration cannot be found :(

In particular, the error message will look something like:

RuntimeError: Table DataManager error: Data Manager class PlasmaData is not registered

This situation is specific to the binary wheels distributed via PyPI. To avoid this issue one must ensure that the python-casacore package uses the same libraries the storage manager was compiled against. This could be done either by installing python-casacore from source and pointing it to an existing casacore installation (which itself might be installed from source or not), or by using pre-compiled packages that don’t incur into this duplication of libraries, like the apt packages provided by the Kernsuite project.

pyarrow

TL;DR:

  • Pre-built binary wheels from PyPI are incompatible with pre-built Arrow apt packages provided by Apache.

  • You can install a different version of pyarrow alongside the pre-built Arrow apt packages, but this might break in the future.

  • You can install pyarrow from sources, building them against the same Arrow/Plasma installation PlasmaStMan was built against.

Apache Arrow makes available binary wheels in PyPI for users to install the pyarrow python package without needing a compiler or any other external libraries. Like in the case of python-casacore, these binary wheels are bundled with their own copy of the Arrow shared libraries (libarrow.so, libplasma.so and so on). For a given version of Arrow, these libraries share the same SONAME with those installed via the Arrow apt repositories. However, the PyPI pyarroww binary wheels are compiled using a version of gcc prior to the introduction that didn’t offer a dual ABI mechanism (read the link for a more detailed explanation). The effect this has is that the arrow libraries generated by newer versions of gcc define differently named symbols than those generated by older versions of gcc, and therefore they cannot be mixed freely (e.g., linked or dynamically loaded). This problem has been reported, but other than acknowledging the issue and providing some suggestions on how to proceed, the final response was that this use case is not officially supported by the Arrow published artifacts.

Because of this situation, problems occur if the python process loads the storage manager, which has been compiled against the apt-installed Arrow libraries, after importing the PyPI-installed pyarrow. In such cases the following situation occurs:

+-------------+
|             |      1.1 no dlopen(), library with same SONAME already loaded
|             |      1.2 check_required_symbols() // fails, symbol not found
| libarrow.so | <--------------------------\
|             |                            |
+-------------+      1. dlopen()     +----------------+
| casacore.so | -------------------> | plasmastman.so |
+-------------+                      +----------------+

In particular, the error message will look something like:

RuntimeError: Shared library plasmastman not found in CASACORE_LDPATH or (DY)LD_LIBRARY_PATH
libcasa_plasmastman.so.4: cannot open shared object file: No such file or directory
libcasa_plasmastman.so: cannot open shared object file: No such file or directory
libplasmastman.so.4: cannot open shared object file: No such file or directory
/usr/local/lib/libplasmastman.so: undefined symbol: _ZN5arrow5fieldENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt10shared_ptrINS_8DataTypeEEbS6_IKNS_16KeyValueMetadataEE

Note that if pyarrow has not yet been imported at the time the storage manager library is loaded then no error occurs:

+-------------+  1. dlopen()  +----------------+  1.1 dlopen()  +-------------+
| casacore.so | ------------> | plasmastman.so | -------------> | libarrow.so |
+-------------+               +----------------+                +-------------+
                                     |                                 ^
                                     |                                 |
                                     \---------------------------------/
                                          1.2. check_required_symbols() // fine

The situation above is a bit brittle as it depends on pyarrow not being loaded at the time. Moreover, loading it later might also lead to the same missing symbol error.

A possibility, somewhat fragile, is to install a version of pyarrow from PyPI different to that installed via apt so the SONAME of both libraries don’t collide. That way, plasmastman.so is forced into loading a different copy of the arrow library into memory. This results in the following:

+---------------+
|               |
|               |
| libarrow.3.so |
|               |
+---------------+  1. dlopen()  +----------------+  1.1 dlopen()  +---------------+
| casacore.so   | ------------> | plasmastman.so | -------------> | libarrow.4.so |
+---------------+               +----------------+                +---------------+
                                       |                                 ^
                                       |                                 |
                                       \---------------------------------/
                                            2. check_required_symbols() // all good :)

This obviously results in two copies of different versions of the Arrow library loaded into memory. Although we haven’t noticed any side-effects, this might not always be the case.

The ultimate solution is of course to avoid the problem with bundled libraries altogether and install pyarrow from source, compiling against the same installation of Arrow/Plasma the PlasmaStMan was compiled against. This results on a clean environment, but has a higher setup cost:

+-------------+
|             |      1.1 no dlopen(), library with same SONAME already loaded
|             |      1.2 check_required_symbols() // all good :)
| libarrow.so | <--------------------------\
|             |                            |
+-------------+      1. dlopen()     +----------------+
| casacore.so | -------------------> | plasmastman.so |
+-------------+                      +----------------+