Subject: improved string performance for Tcl 8.1.2 - DN [1]


hershey@scriptics.com - 16 Jun 1999 - comp.lang.tcl

 We recently made several improvements to the "String" object type in
 core Tcl (8.1.2).  The goal of these changes is to get the 8.1 string
 operations to be as fast as they were in 8.0.  Although 8.1.2 is not yet
 released, you can access these changes via cvs repository at
 www.scriptics.com

 With the new "String" internal rep, strings are stored in either their
 Unicode or UTF format.  The default behavior is to store UTF.  Once the
 Unicode format is needed by a function, it is calculated and stored in
 the internal rep for future access (without an additional O(n) cost).

 A Unicode string is an internationalized string.  Conceptually, a
 Unicode string is an array of 16-bit quantities organized as a sequence
 of properly formed UTF-8 characters.  There is a one-to-one map between
 Unicode and UTF characters.  Because Unicode characters have a fixed
 width, operations such as indexing operate on Unicode data.  The String
 ojbect is opitmized for the case where each UTF char in a string is only
 one byte.  In this case, we store the value of numChars, but we don't
 store the Unicode data (unless Tcl_GetUnicode is explicitly called).

 The following structure is the internal rep for a String object.  It
 keeps track of how much memory has been used and how much has been
 allocated for the Unicode and UTF string to enable growing and shrinking
 of the UTF and Unicode reps of the String object with fewer mallocs.  To
 optimize string length and indexing operations, this structure also
 stores the number of characters (same of UTF and Unicode!) once that
 value has been computed.

 typedef struct String {
     int numChars;  /* The number of chars in the string.
             * -1 means this value has not been
             * calculated. >= 0 means that there is a
             * valid Unicode rep, or that the number
             * of UTF bytes = the number of chars. */
     size_t allocated; /* The amount of space actually
                * allocated for the UTF string, minus
                * 1 byte for the termination char. */
     size_t uallocated;/* The amount of space actually
                * allocated for the Unicode string,
                * minus 1 Tcl_UniChar for the
                * termination char. 0 means the
                * Unicode string rep is invalid. */
     Tcl_UniChar unicode[2]; /* The array of Unicode chars.
                  * The actual size of this field
                  * depends on the "uallocated"
                  * field above. */
 } String;

 Here's a quick benchmark test that shows the relative speeds of Tcl
 8.0.5, 8.1.1, and 8.1.2, which is not yet released.  I ran the benchmark
 tests on a Sun Ultra 5_10, solaris-sparc 5.6.

 test       definition
 ----       ----------
 length:    string length $str
 index:     string index $str $i
 range:       string range $str $endMinus10 end
 append:    append str A
 appUtf:    append str ü
 reverse:   reverse $str

     proc reverse {s} {
         set t ""
         set len [string length $s]
         for {incr len -1} {$len>=0} {incr len -1} {
         append t [string index $s $len]
         }
         return $t
     }

 size means the string size = n * 26
 byte means all chars are 1-byte long
 utf means there are 1/5 chars are multi-byte

 test       size    alpha    8.0.5      8.1.2      8.1.1
 ----    ----    -----    -----     -----   -----
 length  1       byte    16312      14818      16793
 length  1       utf     15944      15440      17406
 length  10      byte    15946      15971      29810
 length  10      utf     15938      15079      36707
 length  100     byte    16007      14478      150522
 length  100     utf     15966      15543      151967

 index   1       byte    16661      17647      19267
 index   1       utf     16896      16934      19421
 index   10      byte    18206      19298      33765
 index   10      utf     17840      19142      34720
 index   100     byte    21687      22634      55088
 index   100     utf     22248      23021      58401

 append  1       byte    15901      17863      14268
 append  1       utf     16759      16597      13977
 append  10      byte    17623      16215      13967
 append  10      utf     15778      16624      13966
 append  100     byte    15781      18083      15738
 append  100     utf     15850      17443      13915

 appUtf  1       byte    15750      21064      14510
 appUtf  1       utf     15825      19643      14450
 appUtf  10      byte    15704      19175      15207
 appUtf  10      utf     18339      20336      14466
 appUtf  100     byte    16449      21366      14523
 appUtf  100     utf     15808      19587      14566

 fixRng  1       byte    28300      32145      27828
 fixRng  1       utf     26901      28760      28197
 fixRng  10      byte    28733      30888      46724
 fixRng  10      utf     27839      27748      49368
 fixRng  100     byte    26608      31740      207401
 fixRng  100     utf     29258      30308      336874

 longRng 1       byte    23503      30101      27760
 longRng 1       utf     23613      27191      27408
 longRng 10      byte    25734      33608      46357
 longRng 10      utf     25579      33699      51651
 longRng 100     byte    43919      53221      221372
 longRng 100     utf     45265      66713      296995

 reverse 1       byte    574      485      643
 reverse 1       utf     573      569      630
 reverse 10      byte    5118      5349      9503
 reverse 10      utf     5133      5037      13393
 reverse 100     byte    52032      57165      359724
 reverse 100     utf     52036      51633      315730

 We also added several new functions that operate on Unicode to the
 public API:

     Tcl_Obj *
     Tcl_NewUnicodeObj(unicode, numChars)

     void
     Tcl_SetUnicodeObj(objPtr, unicode, numChars)

     Tcl_UniChar *
     Tcl_GetUnicode(objPtr)

     Tcl_UniChar
 Tcl_GetUniChar(objPtr, index)

     int
 Tcl_GetCharLength(objPtr)

     Tcl_Obj *
 Tcl_GetRange(objPtr, first, last)

     void
 Tcl_AppendUnicodeToObj(objPtr, unicode, numChars)

 Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new object or modify an
 existing object to hold a copy of the Unicode string given by unicode
 and numChars.

 Tcl_GetUnicode returns an object's value as a Unicode string.
 Tcl_GetUniChar returns the index'th character in the object's Unicode
 representation.  Tcl_GetRange returns a newly created object  comprised
 of the characters between first and last (inclusive) in the object's
 Unicode representation.  If the object's Unicode representation is
 invalid, the Unicode representation is regenerated from the object's
 string representation.

 Tcl_AppendUnicodeToObj appends the Unicode string given by unicode and
 numChars to the object specified by objPtr.  If the object has an
 invalid Unicode representation, then unicode is converted to the UTF
 format and appended to the object's string representation.  Appends are
 optimized  to handle repeated appends relatively efficiently (it
 overallocates the string or Unicode space to avoid repeated
 reallocations and copies of object's string value).

 --
 Melissa Hirschl         <hershey@scriptics.com>
 Software Engineer       http://www.scriptics.com
 Scriptics Corporation:  The Tcl Platform Company

Last modified
1999-09-27

(195.108.246.50)

Note: you are looking at
the snapshot of an old wiki
- much of this information
is likely to be very outdated