[Fwd: [TCLCORE] Ropes]
François Vogel
fvogelnew1 at free.fr
Mon Jan 12 20:59:55 CET 2009
FYI.
For potential consideration of the idea, or not, for Scilab 6.
Francois
-------- Message original --------
Sujet : [TCLCORE] Ropes
Date : Mon, 12 Jan 2009 16:39:17 +0100 (CET)
De : fredericbonnet at free.fr
Pour : tcl9-cloverfield at lists.sourceforge.net,
tcl-core at lists.sourceforge.net
Hi all,
First, happy new year!
Following the discussions we've had a while back about string
representations, Unicode, Tcl9, Cloverfield and the like, I've been
working during the past weeks on a rope package. You can find it here
on the Tcl9 project on Sourceforge:
http://sourceforge.net/project/showfiles.php?group_id=216988
The implementation is a materialization of several ideas I've
developed over the years, with some borrowed from the seminal paper on
Cedar Ropes by Hans Boehm et. al., available here:
http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf
Ropes are string structures where data is not stored in flat
NUL-terminated arrays but in self-balancing binary trees, allowing for
fast insertion/removal of arbitrarily long strings. The package is
built around a dedicated memory allocator based on fixed-size cells,
coupled with a generational, exact, mark-and-sweep garbage collector.
Data structures:
================
Ropes are made of chunks of Unicode string data that can use either
fixed-width formats (native C strings, UCS 1, 2 or 4 bytes) or UTF-8.
Such string data chunks can take a variable number of cells; if the
provided data is larger than the maximum size (here 63 cells, i.e.
1008 bytes minus the 4 or 8 byte header), it is split into several
chunks and assembled transparently. Ropes can be made of chunks having
different representations, thus allowing maximum compacity when mixing
e.g. typically 7-bit Tcl code and foreign language data. Moreover,
native NUL-terminated C strings are recognized as valid ropes from C
code.
Basic string chunks are assembled by concatenation to form larger
ropes. Small strings are transparently merged into flat leaves,
whereas larger ones use a self-balancing binary tree made of concat
nodes. Substrings can also be extracted, and form either flat leaves
for the smallest ones, or use a substr node for larger ones. The
combination of both techniques allows for easy handling of arbitrarily
large strings with minimal duplication and maximal sharing of raw
string data.
Apart from strings, all nodes are designed to fit into one single
cell, thus providing maximum allocation performances and minimal
memory impact.
Indexing is O(1) for flat fixed-width ropes, O(n) for UTF-8 (as with
Tcl), and O(logn) in general for complex ropes. The fact that basic
string chunk size is limited also means that very large UTF-8 strings
have better performances than flat UTF-8 strings (such as those used
by Tcl) thanks to the intermediary indexing levels.
Memory management:
==================
The dedicated memory allocator is based on fixed-size cells (16 bytes
on 32-bit architectures) within page-aligned memory. Each page
contains a bitmask that indicates the cell status. For 1024-byte
pages, this gives 64 16-byte cells, of which one is reserved. This
results in a very small overhead (2 bit per cell) and a very good
memory locality, both improving the performances dramatically (more on
that below) on modern architectures compared to a traditional
allocator (which typically uses linked list structures). This choice
has been made because, with the exception of pure string data, rope
nodes are typically small and easily fit within a 16-byte cell.
Moreover, this allocator is coupled with a generational, exact (as
opposed to conservative), mark-and-sweep garbage collector that again
provides a huge speedup compared to manual free() calls. The only
needed action is root declaration for all ropes that are externally
referenced, using a simple reference counting scheme. The GC process
is fully controllable in the sense that sections of code can be
protected by pausing and resuming the automatic collection.
Generational GC means that older ropes (having survived several GC
cycles) are promoted to older pools that are collected less
frequently; this limits the CPU impact of collections.
Other features:
===============
The package provides iterator structures and procedures, so porting
existing code should be easy. I'm thinking especially about the regexp
engine, which needs flat Unicode strings. Iterators abstract the whole
stuff into a random-access model. Direct traversal of the string is O(1).
A custom rope is available for extensions. Typical use would be for
example memory-mapping of potentially very large datasets, even on
memory-constrained systems (e.g. mobile platforms). Another use case
would be programmatically generated data. Or it could be used to wrap
malloc'd strings into ropes.
Connections with Tcl:
=====================
Tcl currently uses UTF-8 flat strings as its string representation.
There have been some discussions about the format that future versions
should use. One such proposal was augmented strings, where flat
strings would be supplemented by additional information (indexing of
UTF-8 strings for instance). Another concern was memory consumption
needed for the support of UCS4 (32-bit chars).
I think ropes provides a good solution to all this problems. Ropes are
immutable strings, which fit the Tcl model perfectly. One can mix
several formats transparently, making byte arrays unnecessary and
removing a prominent cause of shimmering. And as complex types such as
lists build their string representations from those of their elements,
maximum reuse of existing strings is ensured and limits the memory
impact even further. Last, the impact on client code is minimal thanks
to the automatic garbage collector and the backward compatibility with
C strings. For these reasons I think that ropes would be a good choice
as a native string representation for future versions of Tcl.
Performances:
=============
Of course there are lies, damn lies and benchmarks, but the first
performance test I've run on my system are very satisfactory.
On a Core 2 Duo P8400 2.26GHz WinXP SP3 with 2GB RAM I get the
following results (from test.c):
(figures in ms)
---------------------------------------------------------------------
testAlloc: ropes vs. malloc raw allocation performances
Ropes: 40000000 12-byte ropes = 480000000 data bytes ...
1859 create + 610 GC = 2469
malloc: 40000000 12-byte C strings = 480000000 data bytes ...
6718 malloc + memcpy + 7922 free = 14640
Ropes: 20000000 28-byte ropes = 560000000 data bytes ...
1453 create + 625 GC = 2078
malloc: 20000000 28-byte C strings = 560000000 data bytes ...
3813 malloc + memcpy + 3828 free = 7641
Ropes: 15000000 44-byte ropes = 660000000 data bytes ...
1375 create + 703 GC = 2078
malloc: 15000000 44-byte C strings = 660000000 data bytes ...
3016 malloc + memcpy + 2953 free = 5969*
Ropes: 1000000 1000-byte ropes = 1000000000 data bytes ...
1062 create + 1000 GC = 2062
malloc: 1000000 1000-byte C strings = 1000000000 data bytes ...
1047 malloc + memcpy + 391 free = 1438
---------------------------------------------------------------------
This shows that the allocator+GC performs usually faster than
malloc+free, mostly because of the GC performances compared to free().
The malloc version outperforms the ropes only in the case of a large
number of large strings. So this benchmark shows the real benefits of
automatic memory management even on the simplest cases; in the general
real-world case where a lot of small structure are allocated and freed
during the lifetime of the application, the malloc version would
perform closer to the worst case above, and maybe worse because of
memory fragmentation.
The following test is closer to a real world application: it runs
several (10,000) cycles during which 80 large strings are allocated
and preserved.
---------------------------------------------------------------------
testGeneration:
With all ropes preserved:
10000 x 80 988-byte ropes + roots = 790400000 data bytes : 16953
With no more than 10000 ropes preserved:
10000 x 80 988-byte ropes + roots = 790400000 data bytes : 4406
---------------------------------------------------------------------
This shows the generational properties of the GC: older ropes will be
promoted to older pools and thus traversed less often by the collector.
A real world application during its lifetime would typically store a
fairly constant number of stable objects (global or static data,
business models) and allocate a larger number of short-lived objects
(input, output and temporary values). Generational GC ensures that the
latter get collected more often than the former, and let stable
objects percolate to deeper layers.
Conclusion:
===========
The package needs some polish (test suite, docs, etc.) but I think the
code is fairly usable in its current state. At present it only works
on 32-bit architectures, but a port to 64-bit should be
straightforward. Moreover, as the custom allocator needs page-aligned
memory, I've only implemented a Win32 version based on VirtualAlloc().
Posix systems would need posix_memalign() or something more suitable
to get the same result (it uses a 1024-byte boundary on 32-bit systems).
I plan to design a similar package but for objects, especially lists
since they are very similar to strings, and integrate both closely.
Comments welcome!
------------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It is the best place to buy or sell services for
just about anything Open Source.
http://p.sf.net/sfu/Xq1LFB
_______________________________________________
Tcl-Core mailing list
Tcl-Core at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core
More information about the dev
mailing list