[Fwd: [TCLCORE] Ropes]

Mon Jan 12 20:59:55 CET 2009

FYI.

For potential consideration of the idea, or not, for Scilab 6.

Francois

-------- Message original --------
Sujet : [TCLCORE] Ropes
Date : Mon, 12 Jan 2009 16:39:17 +0100 (CET)
De : fredericbonnet at free.fr
Pour : tcl9-cloverfield at lists.sourceforge.net, 
tcl-core at lists.sourceforge.net

Hi all,

First, happy new year!

Following the discussions we've had a while back about string 
representations, Unicode, Tcl9, Cloverfield and the like, I've been 
working during the past weeks on a rope package. You can find it here 
on the Tcl9 project on Sourceforge:

http://sourceforge.net/project/showfiles.php?group_id=216988

The implementation is a materialization of several ideas I've 
developed over the years, with some borrowed from the seminal paper on 
Cedar Ropes by Hans Boehm et. al., available here:

http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf

Ropes are string structures where data is not stored in flat 
NUL-terminated arrays but in self-balancing binary trees, allowing for 
fast insertion/removal of arbitrarily long strings. The package is 
built around a dedicated memory allocator based on fixed-size cells, 
coupled with a generational, exact, mark-and-sweep garbage collector.

Data structures:
================
Ropes are made of chunks of Unicode string data that can use either 
fixed-width formats (native C strings, UCS 1, 2 or 4 bytes) or UTF-8. 
Such string data chunks can take a variable number of cells; if the 
provided data is larger than the maximum size (here 63 cells, i.e. 
1008 bytes minus the 4 or 8 byte header), it is split into several 
chunks and assembled transparently. Ropes can be made of chunks having 
different representations, thus allowing maximum compacity when mixing 
e.g. typically 7-bit Tcl code and foreign language data. Moreover, 
native NUL-terminated C strings are recognized as valid ropes from C 
code.

Basic string chunks are assembled by concatenation to form larger 
ropes. Small strings are transparently merged into flat leaves, 
whereas larger ones use a self-balancing binary tree made of concat 
nodes. Substrings can also be extracted, and form either flat leaves 
for the smallest ones, or use a substr node for larger ones. The 
combination of both techniques allows for easy handling of arbitrarily 
large strings with minimal duplication and maximal sharing of raw 
string data.

Apart from strings, all nodes are designed to fit into one single 
cell, thus providing maximum allocation performances and minimal 
memory impact.

Indexing is O(1) for flat fixed-width ropes, O(n) for UTF-8 (as with 
Tcl), and O(logn) in general for complex ropes. The fact that basic 
string chunk size is limited also means that very large UTF-8 strings 
have better performances than flat UTF-8 strings (such as those used 
by Tcl) thanks to the intermediary indexing levels.

Memory management:
==================
The dedicated memory allocator is based on fixed-size cells (16 bytes 
on 32-bit architectures) within page-aligned memory. Each page 
contains a bitmask that indicates the cell status. For 1024-byte 
pages, this gives 64 16-byte cells, of which one is reserved. This 
results in a very small overhead (2 bit per cell) and a very good 
memory locality, both improving the performances dramatically (more on 
that below) on modern architectures compared to a traditional 
allocator (which typically uses linked list structures). This choice 
has been made because, with the exception of pure string data, rope 
nodes are typically small and easily fit within a 16-byte cell.

Moreover, this allocator is coupled with a generational, exact (as 
opposed to conservative), mark-and-sweep garbage collector that again 
provides a huge speedup compared to manual free() calls. The only 
needed action is root declaration for all ropes that are externally 
referenced, using a simple reference counting scheme. The GC process 
is fully controllable in the sense that sections of code can be 
protected by pausing and resuming the automatic collection. 
Generational GC means that older ropes (having survived several GC 
cycles) are promoted to older pools that are collected less 
frequently; this limits the CPU impact of collections.

Other features:
===============
The package provides iterator structures and procedures, so porting 
existing code should be easy. I'm thinking especially about the regexp 
engine, which needs flat Unicode strings. Iterators abstract the whole 
stuff into a random-access model. Direct traversal of the string is O(1).

A custom rope is available for extensions. Typical use would be for 
example memory-mapping of potentially very large datasets, even on 
memory-constrained systems (e.g. mobile platforms). Another use case 
would be programmatically generated data. Or it could be used to wrap 
malloc'd strings into ropes.

Connections with Tcl:
=====================
Tcl currently uses UTF-8 flat strings as its string representation. 
There have been some discussions about the format that future versions 
should use. One such proposal was augmented strings, where flat 
strings would be supplemented by additional information (indexing of 
UTF-8 strings for instance). Another concern was memory consumption 
needed for the support of UCS4 (32-bit chars).

I think ropes provides a good solution to all this problems. Ropes are 
immutable strings, which fit the Tcl model perfectly. One can mix 
several formats transparently, making byte arrays unnecessary and 
removing a prominent cause of shimmering. And as complex types such as 
lists build their string representations from those of their elements, 
maximum reuse of existing strings is ensured and limits the memory 
impact even further. Last, the impact on client code is minimal thanks 
to the automatic garbage collector and the backward compatibility with 
C strings. For these reasons I think that ropes would be a good choice 
as a native string representation for future versions of Tcl.

Performances:
=============
Of course there are lies, damn lies and benchmarks, but the first 
performance test I've run on my system are very satisfactory.

On a Core 2 Duo P8400 2.26GHz WinXP SP3 with 2GB RAM I get the 
following results (from test.c):
(figures in ms)

---------------------------------------------------------------------
testAlloc: ropes vs. malloc raw allocation performances

Ropes: 40000000 12-byte ropes = 480000000 data bytes ...
         1859 create + 610 GC = 2469
malloc: 40000000 12-byte C strings = 480000000 data bytes ...
         6718 malloc + memcpy + 7922 free = 14640

Ropes: 20000000 28-byte ropes = 560000000 data bytes ...
         1453 create + 625 GC = 2078
malloc: 20000000 28-byte C strings = 560000000 data bytes ...
         3813 malloc + memcpy + 3828 free = 7641

Ropes: 15000000 44-byte ropes = 660000000 data bytes ...
         1375 create + 703 GC = 2078
malloc: 15000000 44-byte C strings = 660000000 data bytes ...
         3016 malloc + memcpy + 2953 free = 5969*

Ropes: 1000000 1000-byte ropes = 1000000000 data bytes ...
         1062 create + 1000 GC = 2062
malloc: 1000000 1000-byte C strings = 1000000000 data bytes ...
         1047 malloc + memcpy + 391 free = 1438
---------------------------------------------------------------------

This shows that the allocator+GC performs usually faster than 
malloc+free, mostly because of the GC performances compared to free(). 
The malloc version outperforms the ropes only in the case of a large 
number of large strings. So this benchmark shows the real benefits of 
automatic memory management even on the simplest cases; in the general 
real-world case where a lot of small structure are allocated and freed 
during the lifetime of the application, the malloc version would 
perform closer to the worst case above, and maybe worse because of 
memory fragmentation.

The following test is closer to a real world application: it runs 
several (10,000) cycles during which 80 large strings are allocated 
and preserved.

---------------------------------------------------------------------
testGeneration:

With all ropes preserved:
10000 x 80 988-byte ropes + roots = 790400000 data bytes : 16953

With no more than 10000 ropes preserved:
10000 x 80 988-byte ropes + roots = 790400000 data bytes : 4406
---------------------------------------------------------------------

This shows the generational properties of the GC: older ropes will be 
promoted to older pools and thus traversed less often by the collector.

A real world application during its lifetime would typically store a 
fairly constant number of stable objects (global or static data, 
business models) and allocate a larger number of short-lived objects 
(input, output and temporary values). Generational GC ensures that the 
latter get collected more often than the former, and let stable 
objects percolate to deeper layers.

Conclusion:
===========

The package needs some polish (test suite, docs, etc.) but I think the 
code is fairly usable in its current state. At present it only works 
on 32-bit architectures, but a port to 64-bit should be 
straightforward. Moreover, as the custom allocator needs page-aligned 
memory, I've only implemented a Win32 version based on VirtualAlloc(). 
Posix systems would need posix_memalign() or something more suitable 
to get the same result (it uses a 1024-byte boundary on 32-bit systems).

I plan to design a similar package but for objects, especially lists 
since they are very similar to strings, and integrate both closely.

Comments welcome!

------------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It is the best place to buy or sell services for
just about anything Open Source.
http://p.sf.net/sfu/Xq1LFB
_______________________________________________
Tcl-Core mailing list
Tcl-Core at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core