To get started with WEKA, you can download it from the WEKA homepage at http://www.cs.waikato.ac.nz/~ml/weka/.
If you download it you can run it as a regular application and use the GUI. If needed, you can also start WEKA from the command-line, in which case the GUI will still be present. For example, you may want to start it from the command line from one of the nodes if you need a lot of memory.
To run it from the command line, type (for example):
java -Xmx10g -jar weka.jar
Using SVM with WEKA
If you want to run classifiers like SVM, you need to do a bit of extra work. First, you need to download libsvm.jar from http://www.cs.iastate.edu/~yasser/wlsvm/.
Then, when running WEKA, don't use the line above but instead
java -Xmx10g -classpath $CLASSPATH:weka.jar:libsvm.jar weka.gui.GUIChooser
Update on SVM
With large data sets intent on classification, libLINEAR has a much nicer SVM implementation than libSVM. You can find a JAVA version of liblinear here. The liblinear1.5.jar build is known to work, but liblinear-1.51-weights-with-deps.jar does not. -Mimi
When you open WEKA you will be faced with a "GUI Chooser". I usually use the "WEKA Explorer".
From the explorer home screen load in a file with Open file... (WEKA uses the .arff file format). The Visualize tab allows you to visualize the data in lots of nice ways. The Classify tab allows you to apply classifiers. You can choose one of many built-in classifiers, running it on your data. If you select LibSVM here then this is when you would have needed the libsvm.jar file discussed above.
Note: all the classifications you run appear in the "Result list" at the lower-right of the screen. It seems that letting these results accumulate slows down the system, so it's best to remove them once in a while by right-clicking.
In the "Preprocess" tab of the WEKA you can select filters to apply them to the data. One useful filter is "Resample", which lets you resample your data and possibly select a smaller subset. This filter is located in filters-->supervised-->instance-->Resample. Then, if you click on the parameters (or the word "Resample") you can change settings like "biasToUniformClass", which biases the resampling towards having the same number of examples for each class. This is very useful when using pixel training since the number of black pixels is generally much less than white. The parameters "sampleSizePercent" allows the user to resample to a smaller subset. This can be useful for debugging if you want to try things with small data.
Note: To get a small data set for initial tests, I tried exploiting the "Percentage split" option and using a very small fraction of the data (like 0.1%) for training and the rest of testing. This does not work... it takes indefinitely long to train. This must be a WEKA bug... so use the Resample filter instead to get around this.
In the "Classify" tab of the explorer, you can weight the examples by selecting the classifier meta->CostSensitiveClassifier. This meta-classifier just runs some classifier with weighted examples. Enter the parameter screen for the meta-classifier and choose your classifier. Then, to change the weights edit the costMatrix. As an example, if you have two classes and want examples in the second class to be weighted 5 times more than examples in the first class, you want the costMatrix equal to
[0 5 1 0].
The WEKA Experimenter allows you to create a queue of datasets and algorithms (and parameters) to be run and compared to each other. The Experimenter also has a set of built-in statistics that can be calculated and visualized via the "Analyze" section.
To get started, choose "Experimenter" from the GUIChooser. In the "Setup" tab, select the "New" option. This will open up the control panel that allows you to define a new experiment. First you need to specify a location (usually an ARFF file) where you want to save your experiment set up (in case you need to re-run it).
There is actually no way to explicitly specify if you want one ARFF file to be a training set and another to be the test set, but there is a way to abuse one of the options to achieve the same thing. You have 3 choices, (1) CrossValidation, (2) PercentageSplit (randomized), and (3) PercentageSplit (order preserved). (1) CrossValidation is where the entire dataset is split into n-folds (you determine n) wherein every time the algorithm is run with 1 out of n folds held out. This is repeated n times to get a symmetric error estimation. (2) PercentageSplit (randomized) splits the whole dataset into the specified percentages, but each data point is picked randomly from the whole set. (3) PercentageSplit (order preserved). This option just splits the dataset but the original order is preserved. Use this option if you want to specify a training set and a test set (presumably you know their respective sizes).
How many times you want to repeat your experiment. "Algorithms first" means that the first thing you do is run your first algorithm (in the algorithms list) on all the datasets provided, and then run the second algorithm on all the data, and so forth. "Data sets" means going through the list of algorithms on the first dataset, then the second, etc.
Input your data here. If you have a training set and a test set, and you're using PercentageSplit (order preserved) in your "Experiment Type", make sure that your training set comes first.
Specify the algorithms you want to compare. E.g. I'm trying to compare what the soft-margin SVM in libLINEAR does when given different C (cost) parameter values. So I line up 5 libLINEAR classifiers, with different settings for C. The algorithms will be run in the order that you specify.
Now you're ready to run your experiment. Go to the "Run" tab and click "Start".
After termination, your experiment is automatically saved into the location you provided in the "Set up" tab, as well as loaded as "Experiment" for the source data in the "Analyze" tab. Here you can select various tests (e.g. area under ROC, percent correct, time elapsed, etc.) and save the results.