Subject: Re: Speeding up Unicode (WAS Re: 8.1 slower then 8.0???) - DN [1]


Paul Duffin <pduffin@mailserver.hursley.ibm.com> - 26 May 1999 - comp.lang.tcl

 Paul Duffin wrote:
 >
 > Scott Stanton wrote:
 > >
 > > "Donal K. Fellows" wrote:
 > > > I really don't think it is as bad a problem as some people make out,
 > > > especially since I was only ever thinking of using the Unicode rep.
 > > > for stuff like the [string] subcommands.  I think this calls for
 > > > someone to actually implement this stuff and see how it affects code
 > > > in practise.  I don't think that most normal code will take much of a
 > > > hit, and code which does repeated string/RE ops (fairly typically
 > > > these are repeated when they are done at all) will gain.
 > >
 > > We are in the process of implementing the Unicode object type.  We plan to make
 > > the string subcommands, regexp, regsub, and append all be aware of the new
 > > type.  So far we have string index and string length working, and we're
 > > currently working on string range.  In addition, the string subcommand will be
 > > aware of the ByteArray type to improve the performance of binary data
 > > operations.  This implementation gets us back to approximately 8.0 levels of
 > > performance.  I expect that the effect of object type shimmering will be fairly
 > > minor, since it is not all that common to use string operations on other data
 > > types in ways that don't result in new objects being created anyway.
 > >
 >
 > When you say object type shimmering I presume you are talking about
 > conversions between different types.
 >
 > I hope that behind your statement lies a smart implementation of
 > [string length/index/range] which only converts its object argument
 > to a unicode string object if the object is untyped. Otherwise you will
 > cause a LOT of extra conversions in some applications.
 >
 > e.g. the following code should not result in the internal list
 > representation being thrown away but rather should take the hit of
 > working with UTF-8.
 >
 >         set list [list 1 2 3 4 5 6]
 >         set length [string length $list]
 >         set string1 [string index $list 0]
 >         set string2 [string range $list 2 4]
 >
 > The created strings would of course be in unicode and would therefore
 > be very efficient to manipulate.
 >
 > > We will post our performance results as soon as we are done.  I think these
 > > optimizations will have a fairly significant impact on scripts that use the
 > > string operations heavily.
 > >
 >
 > As an aside, I know that one reason why UTF-8 was added was to enable
 > binary data to be used more easily throughout Tcl / Tk and extensions
 > so removing it would involve a lot of work to add binary data support
 > back in. However I would like to know whether you think that this is a
 > possibility as there are many people out there who do not need UTF-8
 > support and cannot afford to pay the overheads required of it.
 >

 Also as this would triple the size of an ascii string compared with
 the same string in 8.0 I think that a special check should be made to
 see if the string is pure ascii and if it is then no unicode needs to
 be generated and the string rep can be worked on as it is now.

 Also the string command should be made ByteArray aware so it can work
 on the underlying binary data efficiently, rather than trawling
 through the UTF-8 representation.

 All this would be much easier if Tcl_ObjTypes supported (and used)
 interfaces as a "string" interface could be added to the "ByteArray"
 type.

 --
 Paul Duffin
 DT/6000 Development    Email: pduffin@hursley.ibm.com
 IBM UK Laboratories Ltd., Hursley Park nr. Winchester
 Internal: 7-246880    International: +44 1962-816880

Last modified
1999-09-27

(195.108.246.50)

Note: you are looking at
the snapshot of an old wiki
- much of this information
is likely to be very outdated