Goose: The GNU Object-Oriented Statistics Environment
Goose is a LGPLed C++ library dedicated to statistical computation. The two
design goals of this project are:
- To create a useful and complete system that takes advantage of
C++'s features to improve the clarity of statistical code and
that is easier to use for programmers.
- To produce a complete set of Guile
bindings, exporting all of the C++ library's functionality to that
environment.
A third, ancillary goal is to provide statistical functionality for Guppi, the
Gnome plotting and data-visualization
program. The desire for non-trivial statistics in Guppi was actually the
motivation for creating Goose.
Goose is being primarily developed under GNU/Linux, but an effort is
being made to insure that it is portable to both other Un*x systems
and to Win32.
You should be aware that Goose is still in the early stages of
development, and parts of it are prone to breakage,
bugginess, and sudden, sweeping, API changes. This is Alpha software.
Anyone who at this time wants to use Goose in a non-trivial way should
stay in touch with the developers via the mailing
list.
With that said, you should also know that the core parts of Goose are
relatively stable and debugged. Goose does have a reasonable number
of useful features, with more functionality being added all of the
time.
The current version of Goose is 0.0.11,
which was released on 18 Oct 1999. It can be
downloaded from
http://ftp.gnu.org/pub/gnu/goose.
(You might also want to use one of the
FSF mirror sites.)
A copy of the latest version can usually also be found at
ftp://ftp.gnome.org/pub/guppi.
RPMs are available from
http://ftp.gnu.org/pub/gnu/goose/RPMS.
Development versions of Goose currently live in the
Gnome Project's
CVS server.
Using the anonymous CVS server, just check out goose.
That server also has a nice mechanism for
browsing the cross-referenced
source code.
The following is a list of features in Goose that should (more or less)
work. Additional features may be available in the development version.
- Numerical functions that are useful for statistical computation, including
- Combinatorial functions:
factorial, log factorial, binomial coefficient, log of binomial coefficient.
- CDF and inverse CDF functions for many common distributions,
including normal (Gaussian), binomial, negative binomial, beta,
chi-square, F, gamma, Poisson, hypergeometric, and Student's t.
- Other useful special functions: gamma function, log of gamma, incomplete
gamma function, log of incomplete gamma.
- A fast, high-quality Mersenne Twister-based random number generator.
- The RealSet class, an optimized container class for statistical data that offers:
- Copy-on-write semantics that allow large containers to be efficiently
copied and passed by value.
- Cached mean, standard deviation, minimum and maximum.
- Caching of a sorted version of the data, and automatic detection
of sorted data. All unnecessary sort operations are eliminated, and
the user never needs to worry about if their data is sorted or not.
- Optimized data transformations: linear, exp, log, logit.
Sorting. Replacement of values by their ranks. Rearrangement by
arbitrary permutations. Random re-ordering.
- Efficient calculation of descriptive statistics (many in constant
time): minimum, maximum, range, sum, mean, variance, standard
deviation, sample standard deviation, percentile, median, quartiles,
interquartile range, deciles, trimmed mean, winsorized mean, arbitrary
moments, geometric mean, harmonic mean, RMS, mean deviation, median
deviation, kurtosis, skewness, Durbin-Watson, autocorrelation.
- Descriptive statistics involving two variables or data sets:
covariance, correlation, Spearman's rho, Kendall's tau,
pooled mean, pooled variance, weighted mean.
- Calculations on empirical distribution functions:
Kolmogorov-Smirnov D, D+, D-, Kuiper's V.
- Statistical tests: t-test, F-test, Kruskall-Wallis, Spearman, McNemar,
Cochran's Q.
- An implementation of simple linear regression includes
- Calculation of confidence intervals for the slope and intercept.
- t- and p-values for the model.
- Pointwise diagnostics: leverage, DFBETAS, DFFITS, and Cook's D.
- Optimized, optionally multi-threaded resampling routines for
bootstrapping the mean, median, standard deviation, skewness,
kurtosis, or the slope and intercept of a simple linear regression.
- Kernel density estimation using Epanechnikov, Biweight,
Triweight, Gaussian and Uniform kernels.
- An "automagical" ASCII import system that can analyze and make
intelligent guesses about the format/layout of text files containing
numeric data.
Goose's Guile bindings currently give access to
- Most of the numerical functions.
- The random number generator.
- Pretty much all of the RealSet's functionality.
- The basics of simple linear regression.
The current "official" forum for discussing Goose is the
guppi-list
mailing list. (We still share a pretty low-traffic mailing list with Guppi.)
Subscription requests should be sent to
guppi-list-request@gnome.org.
Questions and comments can also be sent to
Jon Trowbridge
<trow@gnu.org>.
Goose is mainly being coded by
Jon Trowbridge
<trow@gnu.org>,
but not without a significant amount
of help from other dutiful programmers (in alphabetical order):
- Bradford Hovinen (Guile Extensions, Hypothesis Testing)
- Asger Alstrup Nielsen (Infrastructure, ASCII import)
- Havoc Pennington (General Hacking, Autoconf magic, Aura of Coolness)
- Mikkel Munck Rasmussen (Statistical tests)
Goose is just one of the GNU projects that involves statistical computation,
and may not be the right tool for your job.
Other useful GNU tools include:
- R, an S clone, is a
system for statistical computation and graphics.
- PSPP
(previously known as Fiasco) is an SPSS clone.
It interprets commands in the
SPSS language and produces tabular output in
ASCII, HTML, or PostScript format.
- GSL -
The GNU Scientific Library
is a collection of routines for numerical computing. The
routines are written from scratch by the GSL team in ANSI C, and are
meant to present a modern Applications Programming Interface (API) for
C programmers, while allowing wrappers to be written for very high
level languages. It contains some support for statistical functions.
- GNU Octave
is a high-level language, primarily intended
for numerical computations, that provides a convenient command line
interface for solving linear and nonlinear problems numerically.
It also offers some limited statistical functionality.
If you are aware of any other good free statistics tools that I've
omitted, please
e-mail me so that I can
add them to the list.
Return to GNU's home page.
Please send FSF & GNU inquiries & questions to
gnu@gnu.org.
There are also other ways to
contact the FSF.
Please send comments on these web pages to
webmasters@www.gnu.org,
send other questions to
gnu@gnu.org.
Copyright (C) 1999 Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111, USA
Verbatim copying and distribution of this entire article is
permitted in any medium, provided this notice is preserved.
Updated:
19 Oct 1999 trow