Encoding Data into DNA

4 times as efficient at worst, with the potential for much more!

After watching an item on the BBC news channel here in the UK, where the topic of discussion was around how technology is now allowing us to encode data into pieces of DNA, I thought I would post this small explainer so it would be more easily understandable. 

There is a lot of complicated science and maths involved here, which hopefully I will be able to break down for you. 

So, there are six (6) core parts which make up DNA. Deoxyribose (A sugar with 5 carbon atoms!), Phosphate, and four (4) nitrogenous bases (Adenine, Thymine, Cytosine, and Guanine, also known by their first letters as A, T, C, and G).

Because of the way DNA is made up, the most basic part of DNA is a nucleotide. This is made up of one sugar molecule, one phosphate molecule, and one of the four nitrogenous bases. 

This gives us four (4) possible combinations at this stage.

DPA Deoxyribose, Phosphate, and Adenine
DPT Deoxyribose, Phosphate, and Thymine
DPC Deoxyribose, Phosphate, and Cytosine
DPG Deoxyribose, Phosphate, and Guanine

If we just took these four (4) combinations and used them to encode binary, we would be able to encode two (2) binary bits per combination, meaning this would be twice as effective. 

For example;

DPA 00
DPT 01
DPC 10
DPG 11

However, this would ignore the most effective way to store the data in a string of DNA, as these nucleotides then also join to form a base pair.

A single base pair joins two pairs together. 

However, DNA counts in threes, down each strand of the pair. This is like having letters written on two pieces of paper, down the longest side, placing them next to each other, joining each letter with its partner letter on the other sheet of paper with sellotape, and then counting down each page, ignoring those joins. 

The body does this, and it provides us with sixty-four (64) possible combinations. However in this little explainer, I propose that in order to make this process more effective, and more logical when it comes to data storage, we instead count the pairs, and we count two pairs as a single group, or we count four (4) bases down a single strand as a single group. Either one would work fine.

This will allow for two-hundred and fifty-six (256) possible combinations of the nucleotides when grouped like this. This ignores the Deoxyribose (sugar) and the phosphate, which are always present.

For example, the sixty-four (64) possible combinations using Adenine as its first base are;

AAAA ATAA ACAA AGAA
AAAT ATAT ACAT AGAT
AAAC ATAC ACAC AGAC
AAAG ATAG ACAG AGAG
AATA ATTA ACTA AGTA
AATT ATTT ACTT AGTT
AATC ATTC ACTC AGTC
AATG ATTG ACTG AGTG
AACA ATCA ACCA AGCA
AACT ATCT ACCT AGCT
AACC ATCC ACCC AGCC
AACG ATCG ACCG AGCG
AAGA ATGA ACGA AGGA
AAGT ATGT ACGT AGGT
AAGC ATGC ACGC AGGC
AAGG ATGG ACGG AGGG

This obviously also allows for sixty-four (64) possible combinations for each of the other 3 remaining bases. Meaning we can encode two-hundred and fifty-six (256) possible combinations into just four (4) bases. This results in it being 64x more effective than the possible 0’s and 1’s from binary code. 

We could code each of those sixty-four (64) combinations above with a unique eight (8) digit string of binary.

As a result, AAAA might equal 0000 0000 and GGGG might equal 1111 1111, with all two-hundred and fifty-four (254) remaining combinations in between. AGGG in this example would equal 0011 1111.

With four (4) groups of those sixty-four (64) combinations, we have two-hundred and fifty-six (256) possible combinations, and as there are two-hundred and fifty-six (256) possible combinations in 8-bit binary, we can therefore result in a single group of four (4) bases having two-hundred and fifty-six (256) possible results, instead of just sixty-four (64), meaning we are at quadruple the effectiveness we would otherwise be. There is the potential for there to be so much more than this.

There is no reason to stop at the number of four (4) bases per grouping. Eight (8) bases per grouping would give us sixteen million, seven-hundred and seventy-seven thousand, two-hundred and sixteen (16,777,216) possible combinations, this would allow for some very long strings to be calculated, which may result in the processing time required to return the file back into a piece of binary code being too long a time to make it effective at this time,  but it is possible. This would potentially make the system eight (8) times more efficient than would otherwise be the case.

Having a system this much more efficient than running binary on its own, also needs to take into account the speed with which it can be read from the storage device, and how easily it can be altered, updated, or created (especially if it is read-only, or write-once style devices, which the currently available technology appear to be). Processors which are merged into the devices used to write/read the data itself is something which would need serious consideration in order to offload at least part of this process from the device receiving the data’s main processor. 

This also, doesn’t take into account, error checking, or bit shifting, or anything like that. 

There are a lot of other things which a technology in its infancy such as like will need to take into consideration. There are a lot of avenues which could be pursued in order to bring this technology into everyday lives. Including around encryption and the wider long term storage data security environment. 

It will likely become possible for a whole rack’s worth of storage on hard drives to fit into something the size of a thumb drive when DNA storage becomes more mainstream. This will make data storage on the next 10s of years, virtually unlimited, if the storage can be updated as easily as it currently be read.