\documentclass[final]{beatcs} \begin{document} \beatcsPhD{%Author: José Luis Triviño-Rodriguez} {%Title: (may contain \\) A new learning model of sequences of symbols} {%Language: Spanish} {%Supervisor: (may contain \\) Rafael Morales-Bueno} {%Institute: University of Málaga, Spain} {%Date: 8 May 2002} Many learning model have been developed in order to represent the evolution of a system. An important kind of these models represents the state of the system in every instant by a symbol. So, these models represent the evolution of a system by mean of a sequence of symbols. \emph{Markov chains} is a wide used model that express the evolution of a system in terms of sequences of symbols. Moreover, there exist many models derived from Markov chains. These models represent the behaviour of the system by mean of the probability distribution of every symbol given previous symbols in the sequence. \emph{Markov Chains} and its derived models have been applied to model data sequences that exhibit the \emph{short memory} statistical property. If we consider the (empirical) probability distribution on the next symbol given the preceding subsequence of some given length, then exists a length $L$ (the \emph{memory length}) such that the conditional probability distribution does not change substancially if we condition it on preceding subsequences of length greater than $L$. This feature can be find in many applications related with natural language procesing such as speech recognition, and part of speech tagging. This doctoral dissertation describes a multiattribute variable memory length Markov chain model. This model is called MPSA (Multiattribute Probabilitic Suffix Automata) and it has been developed like a generalization of the PSA model (Probabilistic Suffix Automata) described by Dana Ron. So, the Dana Ron's model can be seen as a MPSG with only one attribute. A MPSA is hard to learn. So, only the learning of the probability distribution defined by one attribute has been developed. Therefore, the hypothesis model used by the learning algorithm is the MPSG model (Multiattribute Predictin Suffix Graph) instead of the MPSA model. Moreover, if has been proved that a MPSG can be learned from a polinomial size sample generated by a MPSA. The main feature of MPSGs is that they combine the learning of sequences of data like Markov chains and the learning based on several attributes like inductive learning. So, in order to learn a MPSG two tasks must be accomplished: to compute the next symbol probability distribution of the target attribute and to compute the needed information (memory length) of the rest of attributes in order to compute this probability distribution. In order to show the practical application of the MPSG, two applications of this model have been describe in this thesis: part of speech tagging and music prediction and generation. On the one hand, a spanish part of speech tagger (POS Tagger) has been developed using the MPSG model. This tagger has been trained with the LexEsp corpus of the UPC. The tagging accuracy of the MPSG tagger is similar to the accuracy of the best spanish taggers. On the other hand, a model of the Bach's chorales has been computed using the MPSG model. Later, several new pieces of music have been generated following this model. It has been proof that these pieces have the same probability distribution that Bach's chorales. Moreover, we performed an auditory test where subjects guessed whether fragments they heard came from the original chorale or from the MPSG simulation. Fifty-two listeners evaluated the music from the MPSG with this test. They correctly classified the fragments about 55\% of the time. \begin{center} \textbf{Table of Contents} \end{center} \begin{itemize} \item[\textbf{1}] \textbf{Introduction \dotfill 1} \item[\textbf{2}] \textbf{Time modeling in AI \dotfill 11} \begin{itemize} \item[2.1] Introduction \dotfill 11 \item[2.2] Time ontology \dotfill 13 \end{itemize} \item[\textbf{3}] \textbf{The time dimension in ML \dotfill 17} \begin{itemize} \item[3.1] Introduction \dotfill 17 \item[3.2] Behaviour pattern learning \dotfill 19 \item[3.3] Learning models of sequences of symbols \dotfill 22 \item[3.4] The time dimension in other models of ML \dotfill 70 \end{itemize} \item[\textbf{4}] \textbf{MPSA and MPSG models \dotfill 75} \begin{itemize} \item[4.1] Introduction \dotfill 75 \item[4.2] MPSA \dotfill 79 \item[4.3] Multiattribute Prediction Suffix Graph (MPSG) \dotfill 85 \item[4.4] MPSG vs Decision Trees \dotfill 96 \end{itemize} \item[\textbf{5}] \textbf{MPSG learning algorithm \dotfill 103} \begin{itemize} \item[5.1] Implementation \dotfill 103 \item[5.2] Analysis of the learning algorithm \dotfill 107 \end{itemize} \item[\textbf{6}] \textbf{Applications of the MPSG model \dotfill 115} \begin{itemize} \item[6.1] Part of speech tagging \dotfill 115 \item[6.2] Using MPSG to predict of generate music \dotfill 125 \end{itemize} \item[\textbf{7}] \textbf{Coclusions and future work \dotfill 143} \item[\textbf{A}] \textbf{MPSG learning subalgorithms \dotfill 165} \item[\textbf{B}] \textbf{Algorithm for sequences generation \dotfill 175} \item[\textbf{C}] \textbf{Doménico Scarlatti \dotfill 179} \end{itemize} \vspace{1cm} \begin{tabular}{r l} \textbf{Author's correspondence address} & José Luis Triviño-Rodriguez \\ & E.T.S. Ingeniería Informática \\ & Boulevard de Louis Pasteur, 57 \\ & Campus Teatinos \\ & 29071 - Málaga \\ & Spain \end{tabular} \end{document} |