How Machine Learning Helps Us Understand the Human Genome


Potential applications surrounding the study of genomes are changing rapidly as innovation and scientific discoveries push the field into new territory. Dr. Keriayn Smith, Assistant Professor in the Department of Genetics at University of North Carolina Chapel Hill, works to understand segments of DNA. A 2022 graduate of Fuqua’s MSQM: Business Analytics program, Smith uses machine learning to detect patterns that explain genes’ functions, expressions and interactions. 

In this conversation with Professor Jeremy Petranka, Smith explains how machine learning allows researchers to meaningfully isolate genomes and what could be next in the rapidly evolving industry around this technology.


The following excerpts from that interview were edited for length and clarity.

JEREMY: Could you give a high-level view of this type of research and the types of innovations that were occurring before around 2010? Like the 80s and 90s, and even early 2000s—what did it look like there?


KERIAYN: That really is a fantastic foundational question, and you're right, you know, it kind of takes us back to more than 10 years ago, maybe 15 or 20. I think what really changed was the development of technologies—what we call deep sequencing technologies—that allowed us to understand the composition of the genome or the entire DNA sequence of a human or animal or plant, any organism really. At a basic level, the components of this DNA sequence—which by the way, for a human would stretch from here to the moon, something like a hundred thousand times if you unwound it all and connected it, end to end—but basically, only four components or nucleotides, abbreviated as AGTC, four letters are what it's comprised of.


It's just that we're dealing with billions of these nucleotide letters, plus different arrangements of these letters, and different types of interactions with a plethora of other cell components. These complex combinations are what drive the vast complexity of a human being, and yet we now have the capabilities and data to study this complexity at a granular level. So, to go back to your point or your question, prior to this, the development of the genomics field, as we call it, took more of a single gene investigative approach or was investigated based on more specific individual hypotheses. So, that's really kind of what changed in the past decade or decades.


J: Using the moon example, was it the case that before—if you were thinking about all these the AGT and Cs just spread out to the moon and back, over and over and over—you were basically taking like a mile chunk of that?


K: Exactly, or even, let's use a mile, for example. You could sequence short segments, and you were able to use that information to study that particular segment. Let's say a mile is equal to a gene. So, you'd be able to study that particular gene, deplete it or manipulate it somehow. But right now, you can not only sequence many genes or identify many genes, you can then use the output of these genomics types of analysis to monitor outcomes based on the changes that happen in cells and whatnot.


J: So, you really kind of had to know where to look and that feels like it's changed from what you implied in the last 10 to 15 years. So, what did change?


K: The technologies to give us the information at the level of billions of nucleotides, and then tools such as CRISPR, for example, that allow us to not only query the function but also allows us to manipulate precise sequences in the DNA. To go into a little bit more about just CRISPR— this is such an amazing technology and what's quite remarkable for me at least, is how simple it is and how well it works. This is the best part. And what it does is allows us to make very, very precise cuts in DNA. For example, it can cut between an AG and a TC—what this can do is then turn genes off or on, or change them, for example. This works with varying levels of success in human cell types, various animal cell types, plant cell types, and I could go on. So what changes is the capacity to read out the sequence of the genome, as well as the development of many tools to manipulate it. We’re kind of developing these things at a rapid pace.


J: It feels like, if I'm understanding you correctly, it's kind of two technologies that have changed the game. One is to be able to, not just get a mile of that sequence to the moon and back, but to effectively get the entire map of what that looks like, and two, the ability now to precisely say, “well, now that we know everything, if we just want to change that three feet right there, we can,” with CRISPR?


K: Exactly. That's powerful and has so many implications and uses already. You know, Jeremy, that's kind of even looking at it in a very simplified way because those are two enabling technologies, but there are so many others. Because we have access to the data and now we can tweak so many approaches that have developed concomitantly as well.


J: And by the way, those that have never heard the term CRISPR before, it stands for Clustered Regularly Interspaced Short Palindromic Repeats.


K: Yes, and you know where CRISPR comes from? It comes from a kind of bacterial immune system. So, it stems from how bacteria fend off viruses, which in itself is kind of really cool, and then they maintain that memory. If the same kind of virus attacks the bacteria later on, they can quickly respond because they've been exposed before. So, that's kind of where the technology itself comes from and we've adapted that to manipulate it in just so many cell types.


J: CRISPR on its own can do one thing, but CRISPR with the data opens up the world. The convergence of these technologies in the genetics world is very similar to the convergence of these technologies in the data science world. That some of the algorithms have been studied for decades, but it wasn't until you had this horsepower of cloud computing and everything else that you really saw it everywhere. And now those two together, it feels like, allows some of the innovations you're seeing. Could you talk about how the data is used in terms of potentially what data science is used, but in general, how you're using it?


So, to give you an idea and I'm sure you're well aware of all of these, I did a quick search and found over 50 genome editing coding software tools. Most of them were open source and I have no idea what they did, but the fact that it's just available now feels like there's so much going on that I'm unaware of.


K: Oh yeah. It's quite amazing. I guess let's think of it this way—we as humans are made up of many different cell types with very different functions. A beating cell in the heart does not do the same thing as a neuron in the brain, yet all of our cells have the same genome or DNA sequence, excluding mutations that arise for different reasons. So, how can we effectively use tech such as CRISPR to know where to cut? Given the vastness of the genome and the billions of letters, this is like a data problem—the sheer numbers of AGTCs, the computations, the interactions, the permutations of all these things—and fortunately, in part because of the way that science is funded, there is a plethora of tools. As you mentioned, many are publicly available and quite straightforward to use. Basically, anyone with appropriate training can learn how to edit these genes and access these tools.


Data science, bioinformatics, and computational knowledge, in general, are critical to answering the big research questions where we are today. Because not only do you have to kind of manipulate or understand the sequence composition, you have to be able to look at multifactorial responses across a sample of individuals, for example.


So, the kind of computational tools and high-powered computing is critical, and it moved us away from where we started—the single gene kind of approaches. Now that the costs are much lower and the sequencing and computing power is much lower, it's just become more accessible.


J: So, if I'm understanding you correctly, you have all these AGTCs just, I'm going to say randomly—it's not random, but it looks random from the outside—just over and over and over. You're actually using the data science and analytics bioinformatics tools to start finding patterns, and not just patterns, but patterns within patterns.


K: Exactly. I can use an example from my research to illustrate this point. The genome sequence, or the order of letters, kind of looks random to an outside eye. But the thought was that, as humans, we'd be much more complex than say a house mouse, for example. That's not how it actually turned out. We have a similar number of genes relative to a mouse, but only 3% or so accounts for proteins, which was a surprise.


The proteins are kind of easier to study because there are some consistencies across species that make it easy to deduce their functions for a new protein, for example. The rest of the genome, the 97%, is mysterious and that's the area I focus on. I use computational approaches, such as machine learning, to use sequence patterns in one species, as well as how proteins interact with certain segments of DNA. Then I use that to kind of predict the [role that particular molecule plays within the cell] in the other 97% of the genome.


So, for example, it could somehow help to make sure that a beating heart cell continues to beat, it might help to optimize that process. And without it, you would have some beating, but you wouldn't have as optimal of a beat as with that particular cardiomyocyte.


J: If you're open to it, I'd love to now talk about actually moving into industry and actually bringing this into industries and innovation. Starting with agriculture, can you talk a little bit about how we're seeing these new tools and abilities, and how it's transforming agriculture potentially at a global level?


K: Yeah, sure. Just to touch on a few applications, CRISPR can be used—and you can just stick with CRISPR as an example of a gene-editing tool—to manipulate plant genes in the same way. Some possible outcomes that people, companies, or researchers would target are more robust or sturdy plants that are resistant to adverse conditions or plant diseases. But not only that, when you think about these pest organisms, using insects as an example, you can edit genes within them to affect their ability to transmit diseases to plants or reproduce less efficiently, as quick examples. In terms of just where we are and how realistic this kind of work is, I'd say the limitations here are really based on how much data we have.


For example, we have the entire genome sequence of a human and most of a mouse fairly complete. We’ve been studying humans and animals for a while. So, the limitation here affecting the reality of what we're able to do in an agricultural context is how much data are already out there. Some of these genomes, especially plants, are even larger than human genomes and it can be more difficult to get accurate data. But once we have that to build upon, we can do the same things as I mentioned before, based on the species of the plant and how easy it is to introduce some of the components of CRISPR.


J: And where are we seeing a lot of that innovation happening? Is it at the kind of traditional, large agricultural companies that might be household names or are we seeing kind of a startup community start, especially living in a world where once the data is out there, it sounds like anyone with access to cloud computing can start doing some work?


K: That's what makes it awesome. It's so accessible and it really is driving just so much innovation. You’d expect large companies who have been invested in agriculture for a while to be pushing in these directions—and this is just such an active space. But also, startups and smaller entities are joining them to push innovation.


In general, I think it's quite broad ranging what their targets are and that they are still different enough where there's room in this space. For example, this could be either developing or improving crops that withstand various climate conditions or require fewer resources and can be grown more easily. Then on top of that, the next level could involve gene tweaks to make a more nutritional product or cooking characteristics. I think, you know, across the spectrum, various entities are pushing innovation and it's a very exciting time to be in this space.


J: One thing that I think a lot of people think about when they hear, you know, modifying genomes, is the thought of ethics. At a high level, could you give us an idea of what decision-making bodies exist within this realm, and who has a seat at the table?


K: So, we hear about designer babies—CRISPR has been used to change genes in human embryos that resulted in live births, as we've heard in the media. Then, there are scientists that are generating sterile mosquitoes to combat diseases—who knows what the long-term environmental impact and biological effects that has on the species and adjacent species.


Policy frameworks exist and there is guidance from regulatory bodies such as the FDA, for example, as it pertains to healthcare. But I wonder, and I guess this is my personal opinion, if regulatory bodies are possibly lagging behind. This is simply due to just the pace at which the science is evolving. We're really progressing at lightning speed and not only with gene editing, but with reproductive science, cloning, gene therapy, etc. So, you know, while there are laws that exist, I don't know that those laws cover the breadth of organisms that we're able to edit using CRISPR.


So, in addition to the FDA and probably agricultural regulatory bodies, I don't know the kind of swath of regulations that we need covering ecological impacts and natural resource management, as well as healthcare issues. And this is just my kind of personal view—we really need scientists, policymakers, and people with varying skill sets at the table working to keep ahead of this. I can't really say who is at the table currently, but that's kind of my personal view on where we are and some considerations.