Typed CSV Specification
Rationale
The purpose of the typed csv specification is to build on the common csv (comma separated value) specification with a standard unambigious format.
The key issues with most csv files at present are:
- Character encodings are not defined
- The data has no type attributes, so
1could be considered an integer, floating point number, or a string. File loaders often have to guess the type, and modelling software needs to be written to explicitly cast types.
Example
An example of typed csv:
# comment lines @ author: name@domain.com @ write_date: 2020_03_50 !,time,score,word,is_first,price,start_date,start_time ?,int,float,str,bool,dec,yyyy_mm_dd,hh_mm_ss *,1,1.23,hello,Y,2.52,2020_03_28,14_20_40
Key Rules
- Typed CSV files are always encoded as UTF-8.
- All header names, header types and data rows must have the same length.
- The first character in the line determines the purpose of the line.
- The first character must be followed by a comma if it is
!,?or*. - the separator defaults to comma
,. @can have a space between it and the metadata key.- All meta data must be above the header row
- Rows must be in the following order: meta > header > types > data
- Comments can be placed anywhere and will be ignored
- Rows must end with a new line character
\n
| Character | Purpose | Notes |
|---|---|---|
| # | comment | ignored |
| @ | metadata | for storing individual values, key and value are separated by a colon : |
| ! | header | names of the columns in the data |
| ? | data types | the type of data in the column |
| * | data row | a row of data values |
Data Types
int: Integer (..., -3, -2, -1, 0, 1, 2, 3, ...)float: Floating Point Number (13523.524), only decimal notation is supportedstr: String/Textbool: Boolean, using the following (case insensitive)T,1,Y,trueevaluate to TrueF,0,N,false, evaluate to False
dec: Decimal for dealing with currencyyyyy_mm_dd: datehh_mm_ss: timeu_: a user defined type
Data for number types (int, float, dec) can optionally have underscore characters (_) as thousand separators. These will be ignored on processing.
Metadata
A single whitespace can be added before the @. Thus the following are valid metadata and mean the same. Any trailing whitespace will be considered part of the key.
Any characters after the colon (:) will be considered part of the value, up until the new line.
The type of the value is not documented.
@key:value @ key:value
Reserved keys
Reserved keys are optional, but can enhance the stability of the data
@length: The number of rows of data, as an integer, if this is supplied and the values does not match the number of data rows, an error will occur.@separator: The separator character(s)@md5-checksum: An 128bit MD5 checksum, presented as 32 hexadecimal digits (0-9a-f), this hash is based on a string containing header, types and data in the order they appear. Metadata and comments are ignored. See https://en.wikipedia.org/wiki/MD5 for details of MD5.
Custom Separator
If the data is likely to contain commas, a custom separator can be specified by
@separator metadata item. The separator can consist of one or more characters.
@separator:^|^
Application specific types
A type beginning with u_ is left to the application to process.
An example would be u_yyyy_mm which would store the year and month.
Caution should be taken to avoid name conflicts.