Machine Learning Algorithms in Java (WEKA)

[ Pobierz całość w formacie PDF ]

core package. The constructor of Instances used by MessageClassifier()
takes three arguments: the dataset s name, a FastVector containing the
attributes, and an integer indicating the dataset s initial capacity. We set the
initial capacity to 100; it is expanded automatically if more instances are
added to the dataset. After constructing the dataset, MessageClassifier()
sets the index of the class attribute to be the index of the last attribute.
UpdateModel()
Now that you know how to create an empty dataset, consider how the
MessageClassifier object actually incorporates a new training message.
The method updateModel() does this job. It first calls cleanupString() to
delete all nonletters and non-whitespace characters from the message. Then
it converts the message into a training instance by calling makeInstance().
The latter method counts the number of times each of the keywords in
m_Keywords appears in the message, and stores the result in an object of the
class Instance from the core package. The constructor of Instance used in
makeInstance() sets all the instance s values to be missing, and its weight to
1. Therefore makeInstance() must set all attribute values other than the
class to 0 before it starts to calculate keyword frequencies.
Once the message has been processed, makeInstance() gives the newly
created instance access to the data s attribute information by passing it a
reference to the dataset. In Weka, an Instance object does not store the
type of each attribute explicitly; instead it stores a reference to a dataset
with the corresponding attribute information.
3 0 6 CHAPTER EIGHT | MACHINE LEARNING ALGORITHMS IN JAVA
Returning to updateModel(), once the new instance has been returned
from makeInstance() its class value is set and it is added to the training
data. In the next step a filter is applied to this data. In our application the
DiscretizeFilter is used to discretize all numeric attributes. Because a
class index has been set for the dataset, the filter automatically uses
supervised discretization (otherwise equal-width discretization would be
used). Before the data can be transformed, we must first inform the filter of
its format. This is done by passing it a reference to the corresponding
input dataset via inputFormat(). Every time this method is called, the filter
is initialized that is, all its internal settings are reset. In the next step, the
data is transformed by useFilter(). This generic method from the Filter
class applies a given filter to a given dataset. In this case, because the
DiscretizeFilter has just been initialized, it first computes discretization
intervals from the training dataset, then uses these intervals to discretize it.
After returning from useFilter(), all the filter s internal settings are fixed
until it is initialized by another call of inputFormat(). This makes it
possible to filter a test instance without updating the filter s internal
settings.
In the last step, updateModel() rebuilds the classifier in our program, an
instance-based IBk classifier by passing the training data to its
buildClassifier() method. It is a convention in Weka that the
buildClassifier() method completely initializes the model s internal
settings before generating a new classifier.
ClassifyMessage()
Now we consider how MessageClassifier processes a test message a
message for which the class label is unknown. In classifyMessage(), our
program first checks that a classifier has been constructed by seeing if any
training instances are available. It then uses the methods described
above cleanupString() and makeInstance() to transform the message
into a test instance. Because the classifier has been built from filtered
training data, the test instance must also be processed by the filter before it
can be classified. This is very easy: the input() method enters the instance
into the filter object, and the transformed instance is obtained by calling
output(). Then a prediction is produced by passing the instance to the
classifier s classifyInstance() method. As you can see, the prediction is
coded as a double value. This allows Weka s evaluation module to treat
models for categorical and numeric prediction similarly. In the case of
categorical prediction, as in this example, the double variable holds the
index of the predicted class value. In order to output the string
corresponding to this class value, the program calls the value() method of
the dataset s class attribute.
8.5 WRITING NEW LEARNING SCHEMES 3 0 7
8.5 Writing new learning schemes
Suppose you need to implement a special-purpose learning algorithm that [ Pobierz całość w formacie PDF ]

Archiwum