How it Works
DNA is the basic unit used as a storage place for biological data. A molecule of DNA is composed of two strands placed around each other. Bases are attached to these strands. Each base of the same strand is connected with some base of the other strand and thus the two strands are held together. The bases are the most important parts of the DNA because, according to their attached position to the strands, the whole DNA can be identified. There are four DNA bases: Adenine (A), Thymine (T), Cytosine (C), Guanine (G). Connections are only allowed between A and T, and also between C and G. Due to bases positions on the strands, the latter can be represented as long sequences of combinations of A, T, C, G, letters. Thus, these sequences, representing DNA strands, constitute strings, which can be computer processed. A genome contains a set of DNA sequences. Genome processing is equivaled to processing of the corresponding set of strings , each composed of long sequences of combinations of the four letters. In the case of RNA, the bases are Adenine (A), Uracil (U), Cytosine (C), Guanine (G). In other words, Uracil (U) replaces Thymine (T) when it comes to RNA.
Sequencing is the process of DNA or RNA mapping to strings of letters. Next Generation Sequencing (NGS) is a recent biological development that achieves high speeds of the mapping process. It is also called massively parallel sequencing, due to its speed and results, however it is a biological process.
From this term, the Next Generation Whole Exome Sequence (WES) and the Whole Genome Sequence (WGS) terms were created.
In many cases, genomics data can be dissected in small parts, which can be independently processed without dependency problems. In this way computer parallel processing techniques can be employed. However, there are still issues that should be taken into consideration.
“All truths are easy to understand once they are discovered; the point is to discover them.”
– Galileo Galilei
Process & Results
Pipelining is a traditional parallel processing method. A series of data processing elements is arranged one after the other. The output of an element becomes the input of the next element. A series of data, separated in pieces, is passing step by step from element to element. Each processing element executes a different processing task on each data piece. While the series of pieces of data is moving from element to element along the pipeline, each piece receives all the processing actions executed by all elements. It is clear that the processing elements work gradually in parallel.
The problems arise from the requirements for highly skilled parallel processing technicians, who must take responsibility for low level system programming and care. Their understanding and knowledge must reach deep to both tools and deployment used. Especially pipelining programming executed through terminals excludes the case of existence of top level abstraction tools that would be handled easily by the users.
Another problem is the job submission. This is a more general problem, and it is not specific to genomics data. When a job is submitted for processing in a cluster, it is initially stored in a queue waiting execution. Besides, a predefined number of nodes, together with their number of cores, must be allocated explicitly for a specific processing purpose through the use of low level programming with scripts. However, an increase of cores causes an increase of the waiting time of the submitted job in the queue, while a decrease of them increases the job execution time. Obviously, optimization is necessary here. The total time consisting of the addition of the queue and execution times must be minimized.
While genomics data can be dissected in parts without interdependencies, the same is not true with tasks. Inter task dependencies exist, since the output of a processing element becomes the input for the next. But these interdependencies are stored within the code of pipeline. Therefore, the maintenance and also the development of pipeline code is not easy.