Artificial intelligence is playing a larger role in the science of genomics every day. Recently, a team of researchers from UC San Diego utilized AI to discover a DNA code that could pave the way for controlling gene activation. In addition, researchers from Australia’s national science organization, CSIRO, employed AI algorithms to analyze over one trillion genetic data points, advancing our understanding of the human genome and through localization of specific disease-causing genes.
The human genome, and all DNA, comprises four different chemical bases: adenine, guanine, thymine, and cytosine, abbreviated as A, G, T, and C respectively. These four bases are joined together in various combinations that code for different genes. Around one-quarter of all human genes are coded by genetic sequences that are roughly TATAAA, with slight variations. These TATAAA derivatives comprise the “TATA Box”, non-coding DNA sequences that play a role in the initialization of transcription for genes comprised of TATA.. It’s unknown how the other approximately 75% of the human genome is activated, however, thanks to the overwhelming number of possible base sequence combinations.
As reported by ScienceDaily, researchers from UCSD have managed to identify a DNA activation code that is employed as often as the TATA box activations, thanks to their use of artificial intelligence. The researchers refer to the DNA activation code as the “downstream core promoter region” (DPR). According to the senior author of the paper detailing the findings, UCSD Biological Sciences professor James Kagonaga, the discovery of the DPR reveals how somewhere between one quarter to one-third of our genes are activated.
Kadonaga initially discovered a gene activation sequence corresponding to portions of DPR when working with fruit flies in 1996. Since that time, Kadonaga and colleagues have been working on determining which DNA sequences were correlated with DPR activity. The research team began by creating half a million different DNA sequences and determining which sequences displayed DPR activity. Around 200,000 DNA sequences were used to train an AI model that could predict whether or not DPR activity would be witnessed within chunks of human DNA. The model was reportedly highly accurate. Kadonaga described the model’s performance as “absurdly good” and its predictive power “incredible”. The process used to create the model proved so reliable that the researchers ended up creating a similar AI focused on discovering new TATA box occurrences.
In the future, artificial intelligence could be leveraged to analyze DNA sequence patterns and give researchers more insight into how gene activation happens in human cells. Kadonaga believes that, much like how AI was able to help his team of researchers identify the DPR, AI will also assist other scientists in discovering important DNA sequences and structures.
In another use of AI to explore the human genome, as MedicalExpress reports, researchers from Australia’s CSIRO national science agency have used an AI platformed called VariantSpark in order to analyze over 1 trillion points of genomic data. It’s hoped that the AI-based research will help scientists determine the location of certain disease-related genes.
Traditional methods of analyzing genetic traits can take years to complete, but as CSIRO Bioinformatics leader Dr. Denis Bauser explained, AI has the potential to dramatically accelerate this process. VarianSpark is an AI platform that can analyze traits such as susceptibility to certain diseases and determine which genes may influence them. Bauer and other researchers made use of VariantSpark to analyze a synthetic dataset of around 100,000 individuals in just 15 hours. VariantSpark analyzed over ten million variants of one trillion genomic data points, a task that would take even the fastest competitors using traditional methods thousands of years to complete.
As Dr. David Hansin, CEO of CSIRO Australian E-Health Research Center explained via MedicalExpress:
“Despite recent technology breakthroughs with whole-genome sequencing studies, the molecular and genetic origins of complex diseases are still poorly understood which makes prediction, application of appropriate preventive measures and personalized treatment difficult.”
Bauer believes that VariantSpark can be scaled up to population-level datasets and help determine the role genes play in the development cardiovascular disease and neuron diseases. Such work could lead to early intervention, personalized treatments, and better health outcomes generally.