Last edit: 05-03-17 Graham Wideman
|Access 2000, Unicode and File Size: Some Notes
Article created: 2001-03-01
Unicode is used in Access and Jet as the format for Text fields. This is a concern because Unicode employs two bytes per character, which for western character sets includes the normal ASCII value as the low order byte and zero for the high-order byte. This would make text fields twice as large as in older versions of Jet.Access, for no useful reason for many users.
To ameliorate this, Access/Jet includes a "Unicode compression" feature, which, in theory looks at the high-order bytes of characters in a text field, and if zero stores the string in a format that omits the zeros. Unicode compression is visible as a property of text fields in the Access Table Design window.
Unfortunately, there are a number of end-user and programming situations where Access or Jet defaults to Unicode Compression disabled. One such situation is where Access 97 databases are imported to Access 2000. It is possibly to manually go through each table, set the compression on, and then compact the database -- though it's sometimes not evident whether this has had the desired effect.
In any case, it's necessary to know how and where to set Unicode compression, so that some of these issues can be handled programmatically.
The following is far from complete, just some notes gathered from various sources, and observations.
Q239527 - ACC2000 Database Bloats When Importing Large Text File
Q208285 - ACC2000- Database Size Is Different When Database Is Converted to Access 2000
Q275561 - ACC2000 New Features in Microsoft Jet 4.0 :
Compressible Data Types All string data type fields can be defined to store data in a compressed format. If you have defined a string data type field with this attribute, data will be compressed as it is stored, and uncompressed when retrieved from the field.
This attribute was added for Character fields because of the change to the Unicode character representation format. Unicode characters uniformly require two bytes for each character. For existing Microsoft Jet databases that predominantly contain character data, this could mean that the database file would nearly double in size when converted to the Microsoft Jet 4.0 format. Yet the Unicode representation of many character sets (those formerly denoted as Single-Byte Character Sets, SBCS) can easily be compressed to a single byte. The following is an example of how to define a field that is to be compressed:
CREATE TABLE TableName (FieldName CHARACTER(255) WITH COMPRESSION)
Memo fields can also be defined to store data in a compressed format. There is a limitation, however. Only Memo field values that, when compressed, will fit within 4,096 bytes or fewer, will be compressed. All other Memo field values will remain uncompressed. This means that within a given table, for a given Memo field, some of the data may be compressed, and some may not.
It should be noted that when going through the Access User Interface (UI), Access will always add the Unicode compression attribute whenever applicable. The only time that the end-user needs to be concerned about adding the compression attribute is when creating a table with the CREATE TABLE SQL syntax. The compression attribute is not accessible via DAO when going through the object model to create a string data type.
Note from me: Indeed there appears to be no direct way to get at the field-level attribute in the DAO object model (eg: it's not part of Field.Attributes). It's conceivable that it can be reached as a generic Field.Properties, I haven't tried that.
Intermediate Microsoft Jet SQL for Access 2000:
With the Microsoft Jet 4.0 database engine, all data for the TEXT data types are now stored in the Unicode 2-byte character representation format. It replaces the Multi-byte Character Set (MBCS) format that was used in previous versions. Although Unicode representation requires more space to store each character, columns with TEXT data types can be defined to automatically compress the data if it is possible to do so.
When you create TEXT data types with SQL, the Unicode compression property defaults to No. To set the Unicode compression property to Yes, use the WITH COMPRESSION (or WITH COMP) keywords at the field-level declaration.
The following CREATE TABLE statement creates a new customers table, this time setting the Unicode compression properties to Yes.
CREATE TABLE tblCompressedCustomers ( CustomerID INTEGER CONSTRAINT PK_tblCompCustomers PRIMARY KEY, [Last Name] TEXT(50) WITH COMP NOT NULL, [First Name] TEXT(50) WITH COMPRESSION NOT NULL, Phone TEXT(10), Email TEXT(50), Address TEXT(40) DEFAULT Unknown)
Note that the WITH COMPRESSION and WITH COMP keywords are declared before the NOT NULL keywords. You can also change an existing field's Unicode compression property with an ALTER TABLE statement, like this:
ALTER TABLE tblCustomers ALTER COLUMN [Last Name] TEXT(50) WITH COMPRESSION
Note The WITH COMPRESSION and WITH COMP keywords listed in the previous SQL statements can be executed only through the Jet OLE DB provider and ADO. They will result in an error message if used through the Access SQL View user interface.
Notes from me:
1. I tried issuing the above SQL ALTER statement through DAO's Database.Execute method, and this returns an error. This is somewhat surprising if we know the underlying engine knows what to do with it!
2. Using ADO I issued the above SQL ALTER and it ran smoothly. However, when I inspected the results in Access, only fields longer than about 3 or 4 characters had their Unicode Compression settings set to Yes. This makes some sense, though it's inconsistent with the ability to set this to true in Access UI, regardless of length.
When converting files from Access 97 to 2000, the Unicode Compression setting is of particular concern when the tables are large (multi-hundred meg)... as any changes to design, including compaction, take a long time. Hence it would be nice to get it right from the start.
1. Whether importing tables with or without data, the import wizard (?!) consistently sets Unicode Compression OFF.
2. You might import only the structure, with the intent of manually (or programmatically) fixing all the Unicode Compression settings, and later importing the data with queries. However, if you do not compact the database prior to running those queries, the Unicode Compression-ON setting will not result in a smaller database. (It seems as though the record layout is already established with UC-OFF and not changed unless compacted.).
Keywords: Unicode compression, A2K, MSAccess 2000, Access 2000, Jet, filesize, file size, compact, convert.