How does the automatic data type conversion work?

If we run a log error, or note, every time we did an automatic data type conversion for you, your log would fill up so quickly, you wouldn’t be able to see anything else. So by default, automatic data type conversions are ignored by DS2, no note or warning is written in the log, processing just continues. But there’ll be times when the automatic conversion causes issues with the way that your program is calculating results. And you’ll be wanting to see that. So you can set this DS2 option TYPECAST right before you run that particular step. And then every time an automatic conversion happens, you do get a warning in a log, and that can help you narrow down where the problem is occurring. Now, we’ll move into the program blocks themselves. And in the program block, we have that global declaration space that goes right before the first method definition starts. These statements are not valid inside the method themselves, and they include RETAIN, DROP, KEEP, RENAME, and ARRAY.

The next thing we’ll talk about is data types. And we said that DS2 could define and manipulate about 17 different ANSI data types, straight up, without any type of data conversion being required from the data source. This is true. And you can create those data types right in a DS2 DATA step also, but where you save the output or the result set, to will control what you can actually write. So for instance, we may be running a DS2 DATA program in a pure SAS environment, and we have decimal data types, and ARCHER, and all those fancy ANSI types as we process them in a program DATA vector. But if we are writing that result set out to a SAS 9 DATA set, then before they can be written to disk, all of those data types will have to be down-converted by the engine to fixed-width character, or double-precision numeric, because that’s all that a SAS 9 DATA set can store.

Now, for example, in Hadoop, if we’re processing Hadoop data, Hive doesn’t have an ANSI time data type. So if you’re storing date, time, or date values, you’ll get ANSI date, and ANSI timestamp values in Hadoop, but if you store a time value, it comes out ARCHER on the Hadoop side. And many databases have differences like this. For example, Oracle stores date, time, and timestamp value, it stores them all as timestamps. So while we can process them, that’s fine until you write them up. And then what can be written out is determined by the target data source that you’re writing to. First, let’s talk about character data types. DS2 can define and manipulate four different character types. This CHAR and CHAR are both fixed-width. And the ARCHER and ARCHER are both variable-length per row.

In all cases, the N indicates the number of characters maximum that can be stored in the column when it’s defined. In CHAR and ARCHER, it’s one bite per character. But for Unicode characters, particularly are used in languages that use one character per word, for example, Chinese, or Japanese, the codes are much more complex, and there are a lot more of them. So generally it’s 2 to 4 bytes per character code. So N still defines the number of characters to store, but it may result in a larger bite on this than you’re used to. However, DS2 can handle CHAR and CHAR in the same session, no matter what session encoding is. Interestingly enough, we’ve just talked about character data, we’ve already talked about more data types than a traditional SAS DATA set could handle. All told. The next data type I’ll like to talk about is floating-point numeric. Now, floating-point numeric data types store assigned but approximate value with a floating-point. These come in basically two flavors, with a FLOAT kind of and in between.

Double precision numeric is the standard SAS numeric value, and in DS2, it’s the standard value for an undeclared variable that you can’t tell what type it is from the context. Double-precision floating-point values are stored in eight bytes. Now, this means that you’re going to get a maximum of about 16 digits of precision. Now, the magnitude you could store is enormous, but the precision will top out about 16 digits. A smaller version of this, the REAL, is only 4 bytes, which tops out at about seven digits of precision. And then there’s FLOAT. Basically, you tell the FLOAT what precision you need, and it’ll store the result in either four or eight bytes, or either DOUBLE or REAL. Now, DS2 has a different type of numeric involved with fractions. This is a DECIMAL data type. This is an exact fractional type.

Whereas the floating-point data types only promise you an approximation of the number that you stored, the decimal data type promises you the exact number that you stored. Of course, under the covers, it’s going to use more bytes than the eight bytes in your double precision numeric, if you’re storing some very high precision data. You can have up to 52 digits of precision overall in a decimal number. When you think about this, this is significantly more precise than you could have got with a floating-point number, only 16 digits. But because you’re limited to 52 digits total, you’re going to see that the magnitude limit on a decimal number is much lower than you can get with a floating point. I like to say that scientists and astronomers love floating-point numerics, because you can have numbers of enormous magnitude stored, and they rarely need more than 16 digits of precision, whereas decimal data types are loved by accountants, who like everything down to the nearest penny, and don’t want any of this approximate stuff. But wait, there’s more. There are integer numeric types.

These store numbers, but have no room to store a fraction. And as a result, they store it in fewer bytes than a similarly sized floating-point or decimal type. These range in size all the way from a TINY INT, which is only one byte, and can only store a number between a negative 128, and a positive 127, all the way up to a BIG INT, which can store up to 19 digits max. INTEGER types always return an exact value if the number stored in them falls within the range that they’re capable of storing. In ANSI systems, we also have DATE, TIME, and TIMESTAMP data types. These should never be confused with SAS dates, times, and date-time values. In SAS, all of those values are stored as double-precision floating-point numerics. And the only thing that distinguishes them from any other number is the format that you apply. So if you have the number 365, that could be my weekly rent, it could be the number of days in a year, it could actually be a date, right? And we wouldn’t know unless the appropriate format was applied to that number.

Leave a Comment