Blog
-- Thoughts on data analysis, software
development and innovation management. Comments are welcome
Post 44
Pipeline Skeleton released
02-Feb-2011
Regarding Bob Carpenter's ready-to-distribute
pyhi package, a
bare bones Python package with all the trimmings (modular structure,
configurability, building automation, etc.), and his apparent latter
skew to C/C++,
I thought it would be interesting and useful to have a similar
package in C++. Such framework could be based on
a sequential processing structure, which modules could be defined
(and redefined) in an external XML config file, and its core implementation
could be abstract with regard
to concrete application needs (declaring pure virtual functions),
thus defining a neat interface ready to be extended
for any particular purpose. So, I've just released the Pipeline Skeleton
(see the CODE section of my homepage).
The Pipeline Skeleton intends to provide an adequate ground framework to
buttress a (e.g. spoken language processing) data processing project
without compromising its future growth.
The main motivation for coding it responds to
Bjarne Stroustrup's claim (for example)
that "modularity is a fundamental aspect of all successful large programs".
The point is to avoid throwing away (and redoing from scratch)
pieces of code produced a while ago because their
original design did not consider any extensibility and/or reusability
aspects. Maintaining such an awful code is a waste of time eventually,
not to mention the problems that arise if many programmers work
on the same project code, it's a headache altogether.
Every time a critical variable is hardcoded,
a piece of code is left undocumented, or the like, the future of a program is jeopardised and
sooner or later its developers will have to face these bad coding practices, unless the
program in question dies prematurely an nobody ever needs to run it again.
Unfortunately, this is the acknowledged style of scientific code (see
this and
this)
and we have to deal with it.
Why C++? Because of performance issues, mainly. Java (to name a most comparable
an extensively used OOP language, raising an eternal question) does perform swiftly with
a JIT compiler under certain circumstances. But when a real-time response is
required or pursued, dealing with large/huge amounts of data (e.g., a
Wikipedia dump or a several-hours long speech corpus),
Java does not yet seem to yield a comparable effectiveness wrt a
natively compiled language like C++. C++ has always been meant for
high performance applications. In the end, powerful virtual machines like
OpenJDK's HotSpot and
LLVM are written in C/C++.
Why a sequential processing structure? Because many speech and language
processing applications rely on some sort of pipeline architecture,
e.g., see the Shinx-4 FrontEnd,
which inspired the modular processing framework of the
EmoLib Affective Tagger.
Nevertheless, there is a design (and thus also implementation)
difference between these examples and the Pipeline Skeleton.
The former leave the data flow control
to the processors (i.e. the modules), as they are arranged in a linked-list.
The latter, instead, builds an array containing all the processors
and iterates over them to process the data. This decision is
motivated by code simplicity (and the Occam's Razor), given that the
arrangement of processors is set in the XML config file and
maintained throughout the processing session (at least this is the modus
operandi that I have always followed to organise and conduct my experiments).
Since no typical insertion an
removal of processors is allowed after the pipeline initialisation step,
there is no apparent need to keep the linked-list structure (anyway,
the std::vector class also allows such operations).
There is though some overhead introduced by the iteration loop (in addition to the common
N dereferences and N function calls for N processors),
but Stroustrup (1999) demonstrates
that the reduction in code complexity can be obtained without loss of efficiency
using the standard library.
Finally, the "cyclic" class hierarchy in the FrontEnd where the pipeline
extends the processor and also contains processors is
reorganised into a tree-like hierarchy for conceptual clarity.
The source code organisation of the Pipeline Skeleton follows common FLOSS
directives (such as src folder, config, doc, README, HACKING, COPYING, etc.). It
only depends externally on the TinyXML++ ticpp
library for parsing XML files, and likewise it makes use of the
premake build script generation
tool. I hope you enjoy it.
--
[Stroustrup, 1999] B. Stroustrup, "Learning Standard C++ as a New Language",
C/C++ Users Journal, pp. 43-54, May 1999
|