Specification of the Cross-Referenced Tables Model

Specification of the Cross-Referenced Tables (XRT) Model

Cross-Referenced Tables are just tab-delimited flat files,
which encapsulate data in an object hierarchy with arbitrary attributes and relationships.

1. General Description

XRT organizes data into different classes (e.g. gene, transcript, clone, mutation, disease etc.); each class has many attributes to describe the properties of its elements. The XRT file Gene.xrt below contains four elements, each with six attributes; the first line specifies the attribute names, while the following lines contain the actual attribute values for elements.

Three attributes (ID, P_ID, C_ID) have special meaning to the system. The ID field (mandatory to all elements) stores the unique identifier of each data element; P_ID / C_ID are used for more complex parent / child relationships between data elements.

Example1: XRT file Gene.xrt keeps the elements of Gene class

ID	Type	Symbol	OldID	LocusLinkID	C_ID/LocusLink
GA0001	Known_Gene	ABP1	HSC7000001	26	LL000026
GA0002	Known_Gene	ACHE	HSC7000002	43	LL000043
GA0003	Known_Gene	ACTB	HSC7000003	60	LL000060
GA0005	Known_Gene	ADCY1	HSC7000005	107	LL000107
......

Example2: XRT file Transcipt.xrt keeps the elements of Transcript class

ID	Source sequence	P_ID/Gene
T00001	AK092514	GA0001
T02767	BX648159	GA0001
T00002	NM_015831	GA0002
T02768	BC001541	GA0002
T00003	NM_000665	GA0002
T00004	AF334270	GA0002
T00005	NM_001101	GA0003
T00006	L05500&;AF497515	GA0005
......

2. Class, Attribute and Unique Identifier

Normally, one class is represented by one XRT file, containing all of its attributes; however, this is not mandatory while data being drawn from multiple sources, in this case, one class can have multiple XRT files, each keeps a subset of attributes of that class.

Class name is specified by the name of XRT file, the string before the first dot (.) of the file name will be the class name. For example, both XRT files Gene.xrt and Gene.OMIM.xrt are for the Gene class.

Gene.OMIM.xrt

ID	OMIM_ID	C_ID/OMIM
GA0001	104610	OMIM_104610
GA0002	100740	OMIM_100740
GA0003	102630	OMIM_102630
......

The first line of XRT file has to be the attributes of its class, attribute names are delimited by a tab, and the first attribute has to be ID. The following lines contain the actual attribute values for elements of the class, one element each line. Every element (or data entry) must have a global unique identifier (kept in the ID attribute, again, ID attribute is mandatory to any XRT table), for example, GA0001 is the unique identifier for Gene ABP1, and T02767 is the unique identifier for a transcript.

Attribute can have single value, no value (empty), or multiple values. In the case of multiple values, use '&;' (initially, '\t' was used) to separate them.

3. Relationship between data elements

Elements in different classes may be related to each other, for example, one gene has one or more transcripts. In the XRT files above, you can see gene GA0001 has two transcripts (T00001 and T02767), gene GA0003 has one (T00005).

Relationships are kept by P_ID and C_ID (stands for parent element ID and child element ID respectively). When you specify the related element of the current element, you always have to indicate the class of the related element as well, such as: P_ID/Gene and C_ID/LoucsLink, meaning the parent class is Gene and the child class is LocusLink respectively.

Relationship can be one to one, one to many or many to many, i.e. one element can have one or more related elements, and/or multiple elements can share one related element. This flexibility makes XRT very powerful to model complex biological data.

Below is an example showing three related classes, the relationship can be summarized as: one Locus has one or more Refseq, one Refseq has one or more ProtDomain. Please note that the Refseq class is represented by two XRT files (Refseq.xrt and Refseq.protein.xrt).