Introduction to Rare Disease Data Center
The Rare Disease Data Center (RDDC), spearheaded by the Artificial Intelligence Innovation Center of the Tsinghua Pearl River Delta Research Institute and supported by Cyagen's biotechnical and genetic expertise, was conceived in February 2021 and proudly launched its inaugural version (RDDC 1.0) in February 2022.
The RDDC has been acclaimed within the scientific and medical community over the first year of its operation, prompting an influx of insightful feedback from professionals. Leveraging these valuable suggestions, we have enacted a series of iterative upgrades, culminating in the release of RDDC 2.0 on July 1, 2023. This enhanced version offers optimized data presentation and interaction capabilities, honing our ability to cater to the sophisticated requirements of data querying and mining in scientific research. As its cornerstone, RDDC persists in its concentration on gene and genetic-related data, committed to exploiting the vast potential of genetic big data for the advancement of bio-artificial intelligence tools.
In the current landscape, where China lacks a public database specifically catering to rare diseases and existing international disease databases fail to adequately visualize data associated with rare diseases, RDDC steps forward to bridge this critical gap. We strive to facilitate swift, intuitive comprehension of rare diseases for healthcare practitioners, academic researchers, and the families affected by rare diseases, employing extensive data visualization without compromising the integrity of the original data, and expediting the filtration of targeted information through an advanced tagging and categorization system, presently under development. Furthermore, RDDC serves as a nexus for domestic rare disease resources, thereby solidifying a robust data foundation indispensable for progressing rare disease research in China.
On the genes page, RDDC has collated gene information from humans, mice, rats, and other planned species. Users can access the following information:
Basic gene information (ID, alias) and function description
Comparison information of the orthologous gene in humans, mice and rats
Locus information of genes
Display of gene-related mutations
Functional domain of corresponding proteins
Gene transcript information
Information on gene-related diseases (in humans)
Information on gene-related phenotypes (in humans)
Gene expression information
Subcellular localization of genes (in humans)
Protein interaction maps
In the Diseases page, the RDDC has gathered information from Malacards, OMIM, Orphanet, ClinVar, and other open-source databases, along with local disease data provided by the Rare Diseases Alliance. Users can access the following information:
Description of basic disease information in different databases
Disease ID and alias
Disease epidemiology information (updating)
Description of the disease`s Human Phenotype Ontology (HPO)
List of disease-related genes and distribution of gene mutations
Progress in the development of disease-related drugs
Information on disease-related clinical trials
In the Mouse-Model page, the RDDC has collected various types of gene-edited mouse models used in numerous studies. Users can access the following information:
Basic information about the mouse model
The methodology used to create the mouse model
Background and phenotype information of hybrid mice involved in gene editing
Phenotype information of hybrid mice involved in gene editing
Publications related to the mouse model
In addition to data cleaning and visual display, the RDDC also applies structured data to AI model training. AI has wide application scenarios in all facets of biomedical research (Figure 1). The RDDC is committed to developing AI application scenario models, ranging from the discovery of rare disease targets to the marketing of rare disease drugs. The current focus of RDDC is still on target discovery, with the development of the RNA Splicing Prediction Tool version 1.0 and Mutation Pathogenicity Prediction Tool version 1.0 already completed.
The RDDC aims to develop AI tools throughout the entire process to explore complex issues in drug development continuously. These issues range from biological mechanism research, potential drug targets, intricate drug reactions, gene therapy drug optimization, animal model/human drug conversion, to drug population effects.
Biomedical Data Modalities
Machine Learning Models
Challenges & Opportunities
Figure 1. RDDC aims to apply AI technology throughout the entire process of drug development for rare diseases.
Tools that have been launched include:
RNA Splicer:This tool can predict whether a base mutation causes changes in mRNA splicing sites, and it can analyze and display the prediction results in detail.
Patho Predict:Using the XGBoost method in machine learning, this tool can predict the degree of disease effect caused by a base mutation. The prediction results can be divided into four pathogenicity
ASO Predict:This tool can predict the best ASO candidate sequence by calculating the binding energy between ASO and the base sequence of the target region, as well as other base pairing indicators
SNP Visualization tool:Users can view the mutation distribution and mutation status of the input gene, making it easier to query mutation hot spots and sites.
Pathway Analysis:This online pathway enrichment tool can visually display the changes in gene expression within a pathway after enrichment.
Rare Diseases - An Overview
The definition of rare diseases varies from country to country, but generally refers to diseases with an incidence rate of less than 1/2000. Due to the lack of single rare disease case, the research progress of rare diseases is far behind that of common diseases, and the classification of various rare diseases is also not clear. According to the U.S. Food and Drug Administration (FDA), there are more than 7,000 rare diseases known worldwide and more than 14,000 rare diseases annotated in database Malacards. There are more than 350 million rare disease patients worldwide, of which nearly 50% are children. Less than 10% of these rare diseases have approved treatments or regimens. It is also worth noting that more than 80% of rare diseases are genetic or gene-related diseases, and a considerable part of them are monogenic diseases.
Rare diseases make up over two-thirds of all diseases worldwide
The proportion of rare diseases in Malacards disease classification
Figure 2. Rare Disease Ratio
In China, due to the large population, nearly 20 million individuals are affected by rare diseases. In 2018, to boost rare disease diagnosis and treatment, China's National Health Commission, the Ministry of Science and Technology, the Ministry of Industry and Information Technology, the Food and Drug Administration, and the Administration of Traditional Chinese Medicine issued the "First Batch of Rare Disease Catalogues." This list includes a total of 121 rare diseases, such as hereditary nephritis (i.e., Alport syndrome), ALS (amyotrophic lateral sclerosis), hemophilia, etc. However, due to the lack of effective animal models for most rare diseases and the small patient population, investment in rare disease research and development is relatively low.
In the realm of rare disease information integration, China launched a rare disease registration system led by Peking Union Medical College Hospital in collaboration with several renowned hospitals nationwide in 2020. Additionally, the U.S.'s NIH and Europe's NORD have made significant contributions, with many rare disease-related non-profit organizations also taking proactive measures. Currently, rare disease information primarily focuses on clinical trials and basic disease data, while the crucial preclinical disease model information is not well-structured. Moreover, databases like MGI, which centers on the mouse model, focus on mouse model phenotypic arrangement, but lack a close integration with clinical information. Orphanet, a database primarily built on structured rare disease epidemiology and standardized naming information, still stores data in packets, limiting efficient display.
Considering the current "island-style" research status of rare diseases and the increasing popularity of rare disease research, propelled by the development of gene editing technology, modern rare disease databases require not only high-quality data but also enhanced database content display. With AI playing an increasingly significant role in life sciences, such as small molecule drug discovery, macromolecular structure prediction, and pathological image recognition, using AI to predict genetic mutation-induced rare diseases and devise therapeutic viral vectors will likely become major breakthroughs in the broader health field. The establishment of RDDC aims to provide a one-stop, interactive visualization platform - "disease - gene - animal model - AI tool" - for patients, physicians, and researchers, ultimately aiding everyone committed to rare disease treatment or research to the greatest extent possible.