What is metadata?

Metadata means data about data. While providing a summary of data, metadata does not provide the content of data itself. Another way to think about metadata is as a library catalogue: it has information about what is available in the library, but doesn’t include any of the information contained in the books. Like a library catalogue, having access to metadata doesn’t mean you have access to the data, but it allows you to see what’s available and what types of things you might be interested in improving or exploring. Metadata helps the classification, access, and storage of data.


Why is metadata important?

Metadata helps to show the relationships between different studies that have been conducted over the years. It also helps improve the accessibility and usability of existing data. For example, if a researcher has a new idea about how to solve a problem, or if there is a need for new answers, having access to metadata can help them figure out whether they need to launch a new study or if there is existing data that could be used. Specific to the CTN, the Metadata Project improves collaboration between cohort studies, helps with the secondary use of previously collected data and samples, enhances future data collection, and helps to quickly identify areas, problems, or groups that may not be receiving enough attention. It is important to note that researchers are still required to follow data access and ethical review procedures for each cohort to use the data.


What did we do?

Led by postdoctoral fellow Dr. Adriana Rodriguez Cruz, the CTN Metadata Project team collected the protocols and documentation from 12 cohort studies. The team then extracted all of the metadata from these documents to create a standardized list of variables collected across these studies. The variables were grouped by subject area and other more specific characteristics and placed into a series of tables, ranging from a general overview to granular details about specific types of data and how they were collected.


What did we find?

In total, over 100 documents were reviewed resulting in 34 tables of variables. The 12 cohort studies collected data covering a range of populations, including women, people living with co-infections, older adults, infants and children, men who have sex with men, and slow progressors. Ten studies had an associated biobank, and eight studies included participant-reported outcome measures, like quality of life. Importantly, even if many studies collected the same variable, the way the data were collected or how the variable was defined was not always consistent.


What did we learn?

The results provide a stark reminder of the work left to be done: for example, while most of the cohorts collect age, education, and ethnicity, few studies include immigration or Indigenous-related information. The results also showed that data collection standards must be regularly appraised to ensure they evolve in keeping with scientific knowledge, social context, and the values of research participants, such as recent efforts to improve reporting on gender, sex at birth, ethnicity, and other identity variables. Finally, the results have identified new priorities for CTN data management, including the need for a best-practices guide for data dictionaries, scales for future work to improve cross-study comparison, and use of standardized data extraction protocols.

Identifying gaps, promoting collaborations


The CTN Metadata Project illustrates the vast scope of CTN-supported cohort studies over the years, allows for the identification of under-researched areas, and most of all, promote new and timely collaborations between researchers. Creating knowledge and maximizing its use supports a core pillar in the effort to eliminate STBBIs as public health concerns.


Why are cohorts important?

From its inception, the CTN has adapted to the shifting priorities of the HIV pandemic, the geographic and epidemiological trends of STBBIs across Canada, and both provincial and federal policy. This adaptability is demonstrated by the breadth, duration, and timeliness of the CTN-supported cohort studies with respect to relevant historical milestones. In addition to adapting its scientific priorities, the CTN has also broadened its scope beyond clinical trials to include basic, social, and behavioural science approaches; Indigenous methodologies, and community-based and implementation research. The first cohort study started in 2002. There are currently nine ongoing studies, each responding to specific aspects of the HIV and STBBI epidemics.


Timeline Content


How can the metadata be used?

The findings of the CTN Metadata Project can be used by researchers to understand the possibilities for combining data from different studies to form new cohorts and answer new research questions. Importantly, access to individual patient data is not a part of this project and researchers are required to follow data access and ethical review procedures of each cohort.

The metadata included in this tool were gathered from some of the largest and most extensive HIV cohort studies in Canada meaning that the results can also be used to understand the geographical and demographic gaps in clinical data.

Data structure

How is the metadata organized?

All metadata records were aggregated into three main categories of heat maps: the master heat map, which includes all the data domains collected by each cohort study; high-level entry tables for each data domain describing the granularity of the data collected; and secondary tables, which provide descriptive information about specific variables and how they were collected.

Data structure

What type of metadata is available?

The most frequently collected data domains collected were sociodemographic characteristics, clinical biological and physiological variables, medical history, inclusion & exclusion criteria, and medications. The least frequent data domains collected were women’s reproductive health and information related to infant health.

Diving deeper

Which variables are available?

For the sociodemographic characteristics data domain, the most frequently collected variables by all studies include age, education, and ethnicity. Variables collected less frequently were immigration-related information and spoken languages. This heatmap is an example of the diversity of information collected by the CTN-supported cohort studies, and the areas where data collection could be improved.

Across all of the variables collected across the cohorts, 18 have an accompanying secondary table, which provides descriptive information about the variables and how they were collected. For example, in the sociodemographics table below, indicated in green, age, education, ethnicity, and gender all have a secondary table that expands on each variable.



One of the goals of the Metadata Project was to improve the use of stored biospecimens for secondary research studies or cross-cohort collaborations. Nine cohorts collected samples for biobanking purposes. The most frequently collected biospecimen is peripheral blood mononuclear cells (PBMCs), followed by serum and plasma. Importantly, these samples can only be shared if participants provide this type of consent. Consent forms for biobanks aim to be clear on how the biological specimens are collected, if secondary use is allowed and for what purpose, and what processes need to take place for the sample to be accessible. Consent is a dynamic process — ongoing discussions of consent with research participants is the best way to protect the privacy of individuals and communities. The consent process should be continually reassessed to ensure it is evolving with community priorities.

Looking forward

What's next?

The CTN Metadata Project is a promising and accessible tool to strengthen future research in HIV, HCV, and other STBBIs, providing multiple options for subsequent analysis, opening doors for collaboration, facilitating use of biobanks, identifying knowledge gaps, enhancing uniformity and quality of data collection, and improving patient engagement and care. This is the first phase of the Metadata Project and contains information only on the gathering and presentation of the metadata. Further analysis of variables collected and their relevance to new research priorities is needed, as well as the augmentation and adherence to guidance documents and standard practices for data collection and management.