Subject: improved string performance for Tcl 8.1.2 - DN [1]
hershey@scriptics.com - 16 Jun 1999 - comp.lang.tcl
We recently made several improvements to the "String" object type in
core Tcl (8.1.2). The goal of these changes is to get the 8.1 string
operations to be as fast as they were in 8.0. Although 8.1.2 is not yet
released, you can access these changes via cvs repository at
www.scriptics.com
With the new "String" internal rep, strings are stored in either their
Unicode or UTF format. The default behavior is to store UTF. Once the
Unicode format is needed by a function, it is calculated and stored in
the internal rep for future access (without an additional O(n) cost).
A Unicode string is an internationalized string. Conceptually, a
Unicode string is an array of 16-bit quantities organized as a sequence
of properly formed UTF-8 characters. There is a one-to-one map between
Unicode and UTF characters. Because Unicode characters have a fixed
width, operations such as indexing operate on Unicode data. The String
ojbect is opitmized for the case where each UTF char in a string is only
one byte. In this case, we store the value of numChars, but we don't
store the Unicode data (unless Tcl_GetUnicode is explicitly called).
The following structure is the internal rep for a String object. It
keeps track of how much memory has been used and how much has been
allocated for the Unicode and UTF string to enable growing and shrinking
of the UTF and Unicode reps of the String object with fewer mallocs. To
optimize string length and indexing operations, this structure also
stores the number of characters (same of UTF and Unicode!) once that
value has been computed.
typedef struct String {
int numChars; /* The number of chars in the string.
* -1 means this value has not been
* calculated. >= 0 means that there is a
* valid Unicode rep, or that the number
* of UTF bytes = the number of chars. */
size_t allocated; /* The amount of space actually
* allocated for the UTF string, minus
* 1 byte for the termination char. */
size_t uallocated;/* The amount of space actually
* allocated for the Unicode string,
* minus 1 Tcl_UniChar for the
* termination char. 0 means the
* Unicode string rep is invalid. */
Tcl_UniChar unicode[2]; /* The array of Unicode chars.
* The actual size of this field
* depends on the "uallocated"
* field above. */
} String;
Here's a quick benchmark test that shows the relative speeds of Tcl
8.0.5, 8.1.1, and 8.1.2, which is not yet released. I ran the benchmark
tests on a Sun Ultra 5_10, solaris-sparc 5.6.
test definition
---- ----------
length: string length $str
index: string index $str $i
range: string range $str $endMinus10 end
append: append str A
appUtf: append str ü
reverse: reverse $str
proc reverse {s} {
set t ""
set len [string length $s]
for {incr len -1} {$len>=0} {incr len -1} {
append t [string index $s $len]
}
return $t
}
size means the string size = n * 26
byte means all chars are 1-byte long
utf means there are 1/5 chars are multi-byte
test size alpha 8.0.5 8.1.2 8.1.1
---- ---- ----- ----- ----- -----
length 1 byte 16312 14818 16793
length 1 utf 15944 15440 17406
length 10 byte 15946 15971 29810
length 10 utf 15938 15079 36707
length 100 byte 16007 14478 150522
length 100 utf 15966 15543 151967
index 1 byte 16661 17647 19267
index 1 utf 16896 16934 19421
index 10 byte 18206 19298 33765
index 10 utf 17840 19142 34720
index 100 byte 21687 22634 55088
index 100 utf 22248 23021 58401
append 1 byte 15901 17863 14268
append 1 utf 16759 16597 13977
append 10 byte 17623 16215 13967
append 10 utf 15778 16624 13966
append 100 byte 15781 18083 15738
append 100 utf 15850 17443 13915
appUtf 1 byte 15750 21064 14510
appUtf 1 utf 15825 19643 14450
appUtf 10 byte 15704 19175 15207
appUtf 10 utf 18339 20336 14466
appUtf 100 byte 16449 21366 14523
appUtf 100 utf 15808 19587 14566
fixRng 1 byte 28300 32145 27828
fixRng 1 utf 26901 28760 28197
fixRng 10 byte 28733 30888 46724
fixRng 10 utf 27839 27748 49368
fixRng 100 byte 26608 31740 207401
fixRng 100 utf 29258 30308 336874
longRng 1 byte 23503 30101 27760
longRng 1 utf 23613 27191 27408
longRng 10 byte 25734 33608 46357
longRng 10 utf 25579 33699 51651
longRng 100 byte 43919 53221 221372
longRng 100 utf 45265 66713 296995
reverse 1 byte 574 485 643
reverse 1 utf 573 569 630
reverse 10 byte 5118 5349 9503
reverse 10 utf 5133 5037 13393
reverse 100 byte 52032 57165 359724
reverse 100 utf 52036 51633 315730
We also added several new functions that operate on Unicode to the
public API:
Tcl_Obj *
Tcl_NewUnicodeObj(unicode, numChars)
void
Tcl_SetUnicodeObj(objPtr, unicode, numChars)
Tcl_UniChar *
Tcl_GetUnicode(objPtr)
Tcl_UniChar
Tcl_GetUniChar(objPtr, index)
int
Tcl_GetCharLength(objPtr)
Tcl_Obj *
Tcl_GetRange(objPtr, first, last)
void
Tcl_AppendUnicodeToObj(objPtr, unicode, numChars)
Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new object or modify an
existing object to hold a copy of the Unicode string given by unicode
and numChars.
Tcl_GetUnicode returns an object's value as a Unicode string.
Tcl_GetUniChar returns the index'th character in the object's Unicode
representation. Tcl_GetRange returns a newly created object comprised
of the characters between first and last (inclusive) in the object's
Unicode representation. If the object's Unicode representation is
invalid, the Unicode representation is regenerated from the object's
string representation.
Tcl_AppendUnicodeToObj appends the Unicode string given by unicode and
numChars to the object specified by objPtr. If the object has an
invalid Unicode representation, then unicode is converted to the UTF
format and appended to the object's string representation. Appends are
optimized to handle repeated appends relatively efficiently (it
overallocates the string or Unicode space to avoid repeated
reallocations and copies of object's string value).
--
Melissa Hirschl <hershey@scriptics.com>
Software Engineer http://www.scriptics.com
Scriptics Corporation: The Tcl Platform Company
Last modified
1999-09-27
1999-09-27
(195.108.246.50)
Note: you are looking at
the snapshot of an old wiki
- much of this information
is likely to be very outdated
