Typed CSV Specification
The purpose of the typed csv specification is to build on the common csv (comma separated value) specification with a standard unambigious format.
The key issues with most csv files at present are:
- Character encodings are not defined
- The data has no type attributes, so
1could be considered an integer, floating point number, or a string. File loaders often have to guess the type, and modelling software needs to be written to explicitly cast types.
An example of typed csv:
# comment lines @ author: email@example.com @ write_date: 2020_03_50 !,time,score,word,is_first,price,start_date,start_time ?,int,float,str,bool,dec,yyyy_mm_dd,hh_mm_ss *,1,1.23,hello,Y,2.52,2020_03_28,14_20_40
- Typed CSV files are always encoded as UTF-8.
- All header names, header types and data rows must have the same length.
- The first character in the line determines the purpose of the line.
- The first character must be followed by a comma if it is
- the separator defaults to comma
@can have a space between it and the metadata key.
- All meta data must be above the header row
- Rows must be in the following order: meta > header > types > data
- Comments can be placed anywhere and will be ignored
- Rows must end with a new line character
|@||metadata||for storing individual values, key and value are separated by a colon
|!||header||names of the columns in the data|
|?||data types||the type of data in the column|
|*||data row||a row of data values|
int: Integer (..., -3, -2, -1, 0, 1, 2, 3, ...)
float: Floating Point Number (13523.524), only decimal notation is supported
bool: Boolean, using the following (case insensitive)
trueevaluate to True
false, evaluate to False
dec: Decimal for dealing with currency
u_: a user defined type
Data for number types (
dec) can optionally have underscore characters (
_) as thousand separators. These will be ignored on processing.
A single whitespace
can be added before the
@. Thus the following are valid metadata and mean the same. Any trailing whitespace will be considered part of the key.
Any characters after the colon (
:) will be considered part of the value, up until the new line.
The type of the value is not documented.
@key:value @ key:value
Reserved keys are optional, but can enhance the stability of the data
@length: The number of rows of data, as an integer, if this is supplied and the values does not match the number of data rows, an error will occur.
@separator: The separator character(s)
@md5-checksum: An 128bit MD5 checksum, presented as 32 hexadecimal digits (0-9a-f), this hash is based on a string containing header, types and data in the order they appear. Metadata and comments are ignored. See https://en.wikipedia.org/wiki/MD5 for details of MD5.
If the data is likely to contain commas, a custom separator can be specified by
@separator metadata item. The separator can consist of one or more characters.
Application specific types
A type beginning with
u_ is left to the application to process.
An example would be
u_yyyy_mm which would store the year and month.
Caution should be taken to avoid name conflicts.