Alexandre Trilla, PhD - Research Engineer |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Pipeline Skeleton released

02-Feb-2011

Regarding Bob Carpenter's ready-to-distribute pyhi package, a bare bones Python package with all the trimmings (modular structure, configurability, building automation, etc.), and his apparent latter skew to C/C++, I thought it would be interesting and useful to have a similar package in C++. Such framework could be based on a sequential processing structure, which modules could be defined (and redefined) in an external XML config file, and its core implementation could be abstract with regard to concrete application needs (declaring pure virtual functions), thus defining a neat interface ready to be extended for any particular purpose. So, I've just released the Pipeline Skeleton (see the CODE section of my homepage).

The Pipeline Skeleton intends to provide an adequate ground framework to buttress a (e.g. spoken language processing) data processing project without compromising its future growth. The main motivation for coding it responds to Bjarne Stroustrup's claim (for example) that "modularity is a fundamental aspect of all successful large programs". The point is to avoid throwing away (and redoing from scratch) pieces of code produced a while ago because their original design did not consider any extensibility and/or reusability aspects. Maintaining such an awful code is a waste of time eventually, not to mention the problems that arise if many programmers work on the same project code, it's a headache altogether. Every time a critical variable is hardcoded, a piece of code is left undocumented, or the like, the future of a program is jeopardised and sooner or later its developers will have to face these bad coding practices, unless the program in question dies prematurely an nobody ever needs to run it again. Unfortunately, this is the acknowledged style of scientific code (see this and this) and we have to deal with it.

Why C++? Because of performance issues, mainly. Java (to name a most comparable an extensively used OOP language, raising an eternal question) does perform swiftly with a JIT compiler under certain circumstances. But when a real-time response is required or pursued, dealing with large/huge amounts of data (e.g., a Wikipedia dump or a several-hours long speech corpus), Java does not yet seem to yield a comparable effectiveness wrt a natively compiled language like C++. C++ has always been meant for high performance applications. In the end, powerful virtual machines like OpenJDK's HotSpot and LLVM are written in C/C++.

Why a sequential processing structure? Because many speech and language processing applications rely on some sort of pipeline architecture, e.g., see the Shinx-4 FrontEnd, which inspired the modular processing framework of the EmoLib Affective Tagger. Nevertheless, there is a design (and thus also implementation) difference between these examples and the Pipeline Skeleton. The former leave the data flow control to the processors (i.e. the modules), as they are arranged in a linked-list. The latter, instead, builds an array containing all the processors and iterates over them to process the data. This decision is motivated by code simplicity (and the Occam's Razor), given that the arrangement of processors is set in the XML config file and maintained throughout the processing session (at least this is the modus operandi that I have always followed to organise and conduct my experiments). Since no typical insertion an removal of processors is allowed after the pipeline initialisation step, there is no apparent need to keep the linked-list structure (anyway, the std::vector class also allows such operations). There is though some overhead introduced by the iteration loop (in addition to the common N dereferences and N function calls for N processors), but Stroustrup (1999) demonstrates that the reduction in code complexity can be obtained without loss of efficiency using the standard library. Finally, the "cyclic" class hierarchy in the FrontEnd where the pipeline extends the processor and also contains processors is reorganised into a tree-like hierarchy for conceptual clarity.

The source code organisation of the Pipeline Skeleton follows common FLOSS directives (such as src folder, config, doc, README, HACKING, COPYING, etc.). It only depends externally on the TinyXML++ ticpp library for parsing XML files, and likewise it makes use of the premake build script generation tool. I hope you enjoy it.

--
[Stroustrup, 1999] B. Stroustrup, "Learning Standard C++ as a New Language", C/C++ Users Journal, pp. 43-54, May 1999