Typed CSV Specification
Rationale
The purpose of the typed csv specification is to build on the common csv (comma separated value) specification with a standard unambigious format.
The key issues with most csv files at present are:
- Character encodings are not defined
- The data has no type attributes, so
1
could be considered an integer, floating point number, or a string. File loaders often have to guess the type, and modelling software needs to be written to explicitly cast types.
Example
An example of typed csv:
# comment lines @ author: name@domain.com @ write_date: 2020_03_50 !,time,score,word,is_first,price,start_date,start_time ?,int,float,str,bool,dec,yyyy_mm_dd,hh_mm_ss *,1,1.23,hello,Y,2.52,2020_03_28,14_20_40
Key Rules
- Typed CSV files are always encoded as UTF-8.
- All header names, header types and data rows must have the same length.
- The first character in the line determines the purpose of the line.
- The first character must be followed by a comma if it is
!
,?
or*
. - the separator defaults to comma
,
. @
can have a space between it and the metadata key.- All meta data must be above the header row
- Rows must be in the following order: meta > header > types > data
- Comments can be placed anywhere and will be ignored
- Rows must end with a new line character
\n
Character | Purpose | Notes |
---|---|---|
# | comment | ignored |
@ | metadata | for storing individual values, key and value are separated by a colon : |
! | header | names of the columns in the data |
? | data types | the type of data in the column |
* | data row | a row of data values |
Data Types
int
: Integer (..., -3, -2, -1, 0, 1, 2, 3, ...)float
: Floating Point Number (13523.524), only decimal notation is supportedstr
: String/Textbool
: Boolean, using the following (case insensitive)T
,1
,Y
,true
evaluate to TrueF
,0
,N
,false
, evaluate to False
dec
: Decimal for dealing with currencyyyyy_mm_dd
: datehh_mm_ss
: timeu_
: a user defined type
Data for number types (int
, float
, dec
) can optionally have underscore characters (_
) as thousand separators. These will be ignored on processing.
Metadata
A single whitespace can be added before the
@
. Thus the following are valid metadata and mean the same. Any trailing whitespace will be considered part of the key.
Any characters after the colon (:
) will be considered part of the value, up until the new line.
The type of the value is not documented.
@key:value @ key:value
Reserved keys
Reserved keys are optional, but can enhance the stability of the data
@length
: The number of rows of data, as an integer, if this is supplied and the values does not match the number of data rows, an error will occur.@separator
: The separator character(s)@md5-checksum
: An 128bit MD5 checksum, presented as 32 hexadecimal digits (0-9a-f), this hash is based on a string containing header, types and data in the order they appear. Metadata and comments are ignored. See https://en.wikipedia.org/wiki/MD5 for details of MD5.
Custom Separator
If the data is likely to contain commas, a custom separator can be specified by
@separator
metadata item. The separator can consist of one or more characters.
@separator:^|^
Application specific types
A type beginning with u_
is left to the application to process.
An example would be u_yyyy_mm
which would store the year and month.
Caution should be taken to avoid name conflicts.