MSc by Research in Computer Science
Back to MSc by Research in Computer Science Home Page
Bioinformatics: Biological Sequence Analysis Strand
The course consists of three parts:
-
supervisions on sequence analysis, machine learning, and molecular
biology background and the types of data available.
-
assessed coursework
-
a research project
It is expected that the supervisions will be tailored to suit the
needs and interests of the student. It is envisaged that only a few
students will follow this course at any time, so that all teaching will be by
individual or small-group supervision.
There is a large and increasing amount of biological sequence data
available.
The analysis of biological sequences is a topic that involves both
some biological knowledge and some advanced computational
techniques. It is therefore not generally taught at undergraduate
level.
The course is intended either for students with biological knowledge
and an aptitude for computation, or for students with computational
knowledge and an interest in molecular biology.
The course aims to give the student a sufficient knowledge of current
problems and methods in sequence analysis to be able to choose and
deliver a good research project in the area.
At the end of the course, the student should have an understanding of
-
the nature and origin of biological sequence data
-
current techniques for modelling,
searching, and annotating this data
-
machine learning techniques relevant to sequence analysis.
The student should be able to plan and carry out a significant
research project on biological sequence data using the techniques learned.
-
Background in Molecular Biology.
-
DNA, RNA, proteins, genetic code, transcription, translation, RNA
editing. Structure of genes.
- Gene expression. Promoters. Examples of genetic regulation and
genetic cascades.
- Structure and evolution of the genome.
- Brief overview of experimental techniques of molecular biology.
-
Modelling, analysis, searching, and alignment of biological
sequences.
Topics covered will include: criteria and methods for sequence
alignment; hidden Markov models for sequence alignment and
characterisation of sequence families; phylogenetic trees; RNA
structure analysis and alignment using context-free grammars.
This section of the course will be based on the book by Durbin et
al. cited below, which is a comprehensive recent tutorial text.
- Techniques of machine learning.
-
Introduction to neural networks.
-
Introduction to support vector machines and other maximal margin
methods.
-
Kernels for sequence comparison.
In addition to question sheets accompanying the supervisions,
students will be required to complete two substantial pieces of
assessed coursework.
-
Search of public databases for sequences related to given protein
sequences. The search should cover both protein databases (for related
proteins) and DNA sequence databases, for related pseudogenes. The
write-up should contain an account both of the results obtained and of the
computational matching techniques used in the
searches.
- Given a set of protein sequences, to construct an alignment of the
sequences, and then to use the Baum-Welch algorithm to train a HMM that
characterises the alignment.
We would expect that most projects would be in the area of applying
machine learning techniques to biological sequences. Students will
make use of the Department's machine learning expertise to find
methods of tackling problems of analysis of biosequences.
The following are examples of current research areas within which a capable
student could conduct a research project:
- Identification of Protein Fold Types
- There is now consensus agreement
that the three dimensional (tertiary) structures of most proteins can classified into a relatively small number of
fold types. It is of interest to predict the fold type of a protein
from its amino-acid sequence because the number of proteins with
known sequences is far larger than the number of proteins with known
three-dimensional structures. Machine learning techniques can be used
to find classification rules for predicting the fold type from the
sequence.
- Identification of Sequence Similarities in Co-Regulated Genes
-
It is now possible to measure the levels of expression (the rate at
which a gene is being used) of very many genes simultaneously. There
are groups of genes, at different places in the genome, which tend to
be expressed together, and may be regulated by the same promoter,
which binds to the DNA sequence ``upstream'' of each gene. A possible
project would be to search for common promoter sequences among genes
with correlated patterns of expression.
- Identification of Exon-Intron Boundaries in Eukaryotic Genes
-
The coding regions of eukaryotic genes (broadly, non-bacterial genes)
are not continuous, but divided up by insertions of DNA which are
excised from the messenger RNA before translation to protein takes
place. Finding the exact locations of where an intron starts and ends
is therefore crucial to correct identification and analysis of a
gene. Machine learning techniques have been used to identify intron
boundaries in real sequences, but there is still some way to go before
the technique is entirely accurate.
Victor Solovyev, Chris Watkins, Hugh Shanahan
-
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis, Cambridge
University Press, 1998.
-
P. Baldi and S. Brunak, Bioninformatics: The Machine Learning
Approach, MIT Press, 1998.
Back to MSc by Research in Computer Science Home Page