SUBJECT

Title

Bioinformatics I-II

Type of instruction

lecture + practical

Level

master

Faculty

Faculty of Science

Part of degree program

Biology MSc

Credits

Recommended in

Semester 1

Typically offered in

Autumn semester

Course description

Lectures

Introduction: The history of bioinformatics. The subject matter and application of bioinformatics. Genome projects. Trends. The overview of the most frequent bioinformatics methods, tools, programme packages in molecular biology. Internet basics: e-mail, telnet, ssh, ftp, www. Bioinformatics on the web. EMBnet, EBI, NCBI
Bioinformatics databases: Molecular biology databases. Primary sequence databases. Nucleotide sequence databases: EMBL, GenBank, DDBJ. Protein sequence database: PIR, SWISS-PROT, TrEMBL. Complex ("non-redundant") protein sequence databases. Sequence database formats. Secondary, derivative protein sequence databases. Complex databases: mapping-genomics databases (genome projects), taxonomy, phylogenetic databases (NCBI/Taxonomy, COG), functional database (Gene Ontology), human genetics database (OMIM), bibliographic database: PubMed
Information retrieval from databases: Search in database annotations: SRS. Integrated information retrieval: NCBI-Entrez. Integrated search in genome projects. Genome browsers: Ensembl, UCSC.
The basics of biological sequence analysis. Handling and checking sequencing data. Contig assembly. Finding restriction sites. Primer design. Depositing new sequences into primary databases.
Sequence comparisons: Pairwise comparisons: dot-plot. Scoring systems, substitution matrices. PAM, BLOSUM matrices. Pairwise sequence alignments: optimal alignment. Global and local alignment; dynamic programming algorithms: Needleman-Wunsch and Smith-Waterman algorithms. Gap penalties.
Similarity searches in sequence databases: Search using optimal alignment algorithms: web implementations. Heuristic search methods: FASTA and BLAST algorithms. Basic statistics: estimating the significance of a hit. Programs of the FASTA3 package. BLAST programmes. Deciding which programme to use. Maximizing signal to noise ratio. Filtering false positive hits: avoiding low complexity regions, repetitive sequences, vector contaminations.
Multiple sequence alignment methods: Progressive alignment, the ClustalW programme. Segment based alignment: Dialign2. Motif-based alignment: MEME. Visualization of multiple alignments.
Molecular phylogenetic analyses I: Overview of phylogenetic analyses: phylogenetic signal, phylogenetic trees. Estimating evolutionary changes, distances: substitution models (amino acid, nucleotide). Phylogenetic reconstruction: UPMGA, least squares, minimum evolution, neighbour joining. Distance based methods.
Molecular phylogenetic analyses II. Character based methods. Maximal parsimony (MP) methods. Consensus tree. Maximum likelihood (ML) methods. MP- and ML-trees heuristic search methods: "branch-and-bound", NNI, SPR, TBR, SD. Statistical tests. Comparison of trees, topological distances. The PHYLIP and PAUP programmes.
Predictions using nucleotide sequences. Detection of functional sites, regions in the DNA. Prediction of coding regions, exon-intron boundaries. RNA secondary structure prediction.
Detection of distant protein sequence similarities. Protein family, domain, functional site searchable databases. Functional prediction. Regular expression, pattern databases: PROSITE patterns. Motif databases: PRINTS, BLOCKS. Position specific scoring matrices (PSSM), profile methods. Iterative searches: PSI-BLAST, PHI-BLAST. Profile and profile HMM databases, PROSITE profile, Pfam, SMART. Clustering databases: ProDom. Integrated databases, search systems: InterPro, DART.
Protein structure, structure prediction I. The levels of protein stuctures. Protein geometry. Protein families, super families. Folding: sequence coded information, hidden information, fold families. Structural classification. Protein structure databases: PDB, MMDB, SCOP, CATH. Protein structure comparisons: PRIDE, genetic algorithms. Structural similarity searches, alignment based on structure. Visualization of 3D structure data. Structure representations. The most frequently used visualization packages.
Protein structure prediction II. Modeling: Modeling in the practice. Statistical methods: Chou-Fasman prediction, secondary structure prediction, neural-network based systems, motif and domain predictions, identification of low complexity regions, search for transmembrane regions. Homolgy modelling. Molecular mechanic/dynamic methods. Ab initio methods. Reliability of models. Applications.
Gene expression analyses: Proteomics, EST projects, EST clustering. Cluster analysis of DNA chip data. Protein identification: 2D electrophoresis, evaluation of mass spectrometry data. Protein interaction maps.

Practicals

The use of UNIX/Linux operational system. Remote access: ssh, telnet. File handling, file movement among computers: scp, sftp, ftp, email. Program execution: redirection of programme out- inputs. Computer networking basics.
Bioinformatics databases. The three primary database web access. Comparison of field structures. Protein databases, comparisons. The most significant domain database. Examples for complex databases.
Information retrieval from databases I: The NCBI Entrez system. Search in the PubMed bibliographic database. Link out from the PubMed database to other Entres databases. Search in other Entrez databases.
Information retrieval from databases I: Demonstration of the SRS search and retrieval system. Simple queries in SRS. Complex queries in SRS. Using external programs from the SRS. Sequence retrieval with SRS.
Basics of computer sequence analyses. Sequence handling, sequence formats. Demonstration of sequence analysis programmes, programme packages. Sequence handling in the EMBOSS programme package.
Computing tasks related to sequencing. Computerized primer design. Restriction site search. Data analyses from automated sequencers. Contaminating sequence removal. Sequence assembly. Sequence submission through the web into primary database.
The use of Genome databanks. Whole genome sequences. The Ensemble genome browser. Queries in complete genomes. Genome region comparisons.
Sequence comparisons, sequence alignments. Dot-plot methods. Global and local alignments (DNA-DNA, protein-protein). Determination of exon-intron boundaries by aligning cDNA/genomic DNA and protein/genomic DNA. Programs in the EMBOSS package and online resources.
Similarity searches in sequence databases. Parametrising and use of FASTA3 and BLAST. Formatting local BLAST database. Scoring matrices, gap penalties. Masking low complexity and repeated sequences. Query in local and remote databases. Analyses of the results.
Multiple sequence alignment. Protein domain structure prediction. Demonstration of the available programmes, applicability and comparisons. Nucleic acid and protein sequence alignments. Search for conserved regions in the aligned sequences. Protein domain, pattern and motif databases. Known domain and motif search in proteins. Profile HMM construction from aligned domain sequences. Search for proteins containing a given motif from the SWISS-PROT database. PSI-BLAST searches.
Molecular phylogenetic analyses. Demonstration of the PHYLIP and PAUP packages. Demonstration of other packages. Phylogenetic reconstruction with different distance based (MP, ML) methods. Bootstrap analysis.
Protein structure: The PDB database, structure search, structure retrieval. Visualisation of structural data: ICMLite, SwissPDBViewer, Rasmol. Highlighting amino acids, regions, side chains.
Protein structure prediction: Sructure prediction with homology modelling. Modell building of an unknown sequence. Evaluation of the model (Ramachandran plot). Structure prediction without sequence homolgy. Prediction of transmembrane region (DAS algorithm), low complexity regions (SEG), domain prediction (SBASE), secondary structure prediction (PHD). Sequence and structure alignment. Fold recognition (HOMSTRAP).
Microarray data analysis: sample expression data cluster analysis. SNP CHIP analysis. Statistical methods. Microarray data visualization.

Readings

D. W. Mount: Bioinformatics. Sequence and genome analysis, Cold Spring Harbor Laboratory Press, ISBN 978-087969712-9, 2004
T. K. Attwood, D. Parry-Smith: Introduction to bioinformatics; Longman, 2001, ISBN 978-0582327887
J. Felsenstein: Inferring phylogenies; Sinauer Associates (2003), ISBN-13: 978-0878931774
A. Leach: Molecular modelling Principles and applications, Prentice Hall (2001) ISBN 9780582382107
Agricultural Biotechnology Institute