Getting Started with Machine Learning

“Machine learning” is a mystical term. Most developers don’t need it at all in their daily work, and the only details about it we know are from some university course 5 years ago (which is already forgotten). I’m not a machine learning expert, but I happened to work in a company that does a bit of that, so I got to learn the basics. I never programmed actual machine learning tasks, but got a good overview.

But what is machine learning? It’s instructing the computer to make sense of big amounts of data (#bigdata hashtag – check). In what ways?

  • classifying a new entry into existing classes – is this email spam, is this news article about sport or politics, is this symbol the letter “a”, or “b”, or “c”, is this object in front of the self-driving car a pedestrian or a road sign.
  • predicting a value of a new entry (regression problems) – how much does my car cost, how much will the stock price be tomorrow.
  • grouping entries into classes that are not known in advance (clustering) – what are you market segments, what are the communities within a given social network (and many more applications)

How? With many different algorithms and data structures. Which are fortunately already written by computer scientists and developers can just reuse them (with a fair amount of understanding, of course).

But if the algorithms are already written, then it must be easy to use machine learning? No. Ironically, the hardest part of machine learning is the part where a human tells the machine what is important about the data. This process is called feature selection. What are those features, that describing the data in a way, that the computer can use them to identify meaningful patterns. I am no machine learning expert, but the way I see it, this step is what most machine learning engineers (or data scientists) are doing on a day-to-day basis. They aren’t inventing new algorithms; they are trying to figure out what combinations of features for a given data gives best results. And it’s a process with many “heuristics” that I have no experience with. (That’s an oversimplification, of course, as my colleagues were indeed doing research and proposing improvements to algorithms, but that’s the scientific aspect of things)

I’ll now limit myself only to classification problems and leave the rest. And when I say “best results”, how is that measured? There are the metrics of “precision” and “recall” (they are most easily used for classification into two groups, but there are ways to apply them to multi-class or multi-label classification). If you have to classify an email as spam or not spam, your precision is the percentage of the emails properly marked as spam from all the emails marked as spam. And the recall is the percentage of emails properly marked as spam from the total number of emails marked as spam. So if you have 200 emails, 100 of them are spam, and your program marks 80 of them as spam correctly and 20 incorrectly, you have a 80% precision (80/80+20) and 80% recall (80/100 actual spam emails). Good results are achieved when you score higher in these two metrics. I.e. your spam filter is good if it correctly detects most spam emails, and it also doesn’t mark non-spam emails as spam.

The process of feeding data into the algorithm is simple. You usually have two sets of data – the training set and the evaluation set. You normally start with one set and split it in two (the training set should be the larger one). These sets contain the values for all the features that you have identified for the data in question. You first “train” your classifier statistical model with the training set (if you want to know how training happens, read about the various algorithms), and then run the evaluation set to see how many items were correctly classified (the evaluation set has the right answer in it, so you compare that to what the classifier produced as output).

Let me illustrate that with my first actual machine learning code (with the big disclaimer, that the task is probably not well-suited for machine learning, as there is a very small data set). I am a member (and currently chair) of the problem committee (and jury) of the International Linguistics Olympiad. We construct linguistics problems, combine them into problem sets and assign them at the event each year. But we are still not good at assessing how hard a problem is for high-school students. Even though many of us were once competitors in such olympiads, we now know “too much” to be able to assess the difficulty. So I decided to apply machine learning to the problem.

As mentioned above, I had to start with selecting the right features. After a couple of iterations, I ended up using: the number of examples in a problem, the average length of an example, the number of assignments, the number of linguistic components to discover as part of the solution, and whether the problem data is scrambled or not. The complexity (easy, medium, hard) comes from the actual scores of competitors at the olympiad (average score of: 0-8 points = hard, 8-12 – medium, >12 easy). I am not sure whether these features are related to problem complexity, hence I experimented with adding and removing some. I put the feature data into a Weka arff file, which looks like this (attributes=features):

@RELATION problem-complexity

@ATTRIBUTE scrambled {true,false}
@ATTRIBUTE complexity {easy,medium,hard}


The evaluation set look exactly like that, but smaller (in my case, only 3 items so far, waiting for more entries).

Weka was recommended as a good tool (at least for starting), and it has a lot of algorithms included, which one can simply reuse.

Following the getting started guide, I produced the following simple code:

public static void main(String[] args) throws Exception {
    ArffLoader loader = new ArffLoader();
    loader.setFile(new File("problem_complexity_train_3.arff"));
    Instances trainingSet = loader.getDataSet();
    // this is the complexity, here we specify what are our classes, 
    // into which we want to classify the data
    int classIdx = 5;
    ArffLoader loader2 = new ArffLoader();
    loader2.setFile(new File("problem_complexity_test_3.arff"));
    Instances testSet = loader2.getDataSet();
     // using the LMT classification algorithm. Many more are available   
     Classifier classifier = new LMT();
     Evaluation eval = new Evaluation(trainingSet);
     eval.evaluateModel(classifier, testSet);  

     // Get the confusion matrix
     double[][] confusionMatrix = eval.confusionMatrix();

A comment about the choice of the algorithm – having insufficient knowledge, I just tried a few and selected the one that produced the best result.

After performing the evaluation, you can get the so called “confusion matrix”, (eval.toConfusionMatrix) which you can use to see the quality of the result. When you are satisfied with the results, you can proceed to classify new entries, that you don’t know the complexity of. To do that, you have to provide a data set, and the only difference to the other two is that you put question mark instead of the class (easy, medium, hard). E.g.:


Then you can run the classifier:

ArffLoader loader = new ArffLoader();
loader.setFile(new File("unclassified.arff"));
Instances dataSet = loader.getDataSet();

DecimalFormat df = new DecimalFormat("#.##");
for (Enumeration<Instance> en = dataSet.enumerateInstances(); en.hasMoreElements();) {
    double[] results = classifier.distributionForInstance(en.nextElement());
    for (double result : results) {
        System.out.print(df.format(result) + " ");

This will print the probabilities for each of your entries to fall into each of the classes. As we are going to use this output only as a hint towards the complexity, and won’t use it as a final decision, it is fine to yield wrong results sometimes. But in many machine learning problems there isn’t a human evaluation of the result, so getting higher accuracy is the most important task.

How does this approach scale, however. Can I reuse the code above for a high volume production system? On the web you normally do not run machine learning tasks in real time (you run them as scheduled tasks instead), so probably the answer is “yes”.

I still feel pretty novice in the field, but having done one actual task made me share my tiny experience and knowledge. Meanwhile I’m following the Stanford machine learning course on Coursera, which can give you way more details.

Can we, as developers, use machine learning in our work projects? If we have large amounts of data – yes. It’s not that hard to get started, and although probably we will be making stupid mistakes, it’s an interesting thing to explore and may bring value to the product we are building.