The study of genomics has produced numerous insights into biology. Sequencing organisms across the phylogenetic tree has illustrated the high level of relatedness among species. Using the degree of differences in specie genomes we are able to calculate their distance of divergence in their evolution. This has allowed us to further refine the phylogenetic tree. The high degree of homology among functional regions of the genome has allowed us to confer lessons learnt about proteins in one species to those found in another. Studying the sequence similarities between coding regions has enabled us to predict novel genes, which can then later be confirmed by experimental studies.
In humans roughly 20 000 proteins coding genes have been identified. Mutations in these genes are responsible for some genetic disorders. With the thousands of human genomes sequenced gene wide associations studies are able to predict phenotypes based on multiple genetic variances. Sequencing the genomes of cancers enables the specific characterisation of cancers, thereby determining their molecular cause and perhaps the application of specific treatment.
It is now possible to sequence an entire human genome in a day. This produces a very large quantites of data which requires substantial space for storage and computing power for analysis. The whole human genome is not always necessary, sequencing only the exomes is often sufficient. If a specific gene has been identified as problematic, only the sequence of these gene may be necessary. Recently there is a huge drive to sequence a greater proportion of the African population, to increase the diversity of sequences within the genomics repositories. This initiative is spearheaded by H3Africa.
Genomics data generation can be performed in a variety of ways. If a small specific area of the genome such as gene or control sequence needs to be studied. This area can be amplified by PCR, quantitated and characterised. Recently this have been shown to be effective in identifying infection of SARS-CoV-2 in the COVID-19 pandemic. If more global information about the genome is required commercial or custom DNA microarrays are available to determine genomic variances. Sanger sequencing which was originally used to sequence the human genome has been optimised to allow high throughput applications. Next generation sequencing now allows rapid sequencing of an entire genome within a day. The whole genome can be sequence or just the exomes, depending on the level of detail required.
When designing an experiment it is important to first identify the level of information required, then select the appropriate technology. Low coverage techniques will be low cost and allow measurement of many samples. High resolution full coverage approaches, while providing more information are often too costly to measure many samples. A sufficient number of replicates is required to made meaningful conclusions, choosing the right technology for an experiment is therefore vital. Depending on the data produced the types of data analysis will differ considerable. Simple techniques like PCR will produce simple results with simple analysis outcomes. More complex analysis techniques will produces more complex results, which will require substantial data storage space and computational power to analyse. It is best to familiarise yourself with the data analysis steps and requirements before generating data.
It has become a requirement of publication to make the data generated for a publication, publicly available. There are a number of data repositories for data deposition. These data are processed and compiled and made accessible via genome browsers. Omics experiment often rely heavily on available data as a basis for data processing. Sometimes it may even be possible to make use of existing data rather than needing generate new data. Numerous tools have been developed to analyse the different types of genomics data. Usually more than one tool is required to complete an analysis. Tools are then combined into pipelines of workflows. In the workflows tools are arranged so that data from one tool is already compatible with the next tool. Through this platform we provide assistance in choosing and using the appropriate tools and pipelines for a particular analysis. We also provide access to experts in the usage of these tools to either advise or perform the analysis.
The two main databases are the NCBI Genome Database and the Ensemble Genome Database. These databases contain genomes across all kingdoms of organisms. Most of the sequences though are focused on model organisms. Thousands of organism have been sequenced, as well as thousands of humand genomes. There are also a number of databases for specific purposed like the cancer genome databases, that focus on genomes of a variety of cancers.