Scikit-learn is a powerful and popular package for machine-learning. One featue that we read about but had never actually used is the Pipeline class. When you have some raw data that you need to preprocess or that you have to extract features from before running a classifier on it, you can take these preprocessing/feature extraction steps and combine them with the classifier into one pipeline object. This pipeline object works as if it was a classifier itself and runs all steps needed.
In order to play with the pipeline feature we wanted to get some image data, that we can classify and combine OpenCV functions with an sklearn classifier in such a pipeline.
The data we use is kaggles apple-bananas-oranges dataset. It consists of images of images of both fresh and spoiled apples, bananas and oranges. To make things simpler we only used the fresh images, so therefore we had three categories. And to avoid making things too simple we converted the images to grayscale. (Otherwise you can classify by simply checking for red, yellow and orange)
An old but still interesting algorithm for extracting features from images is SIFT or Scale-Invariant Feature Transform. We are using the OpenCV implementation of SIFT. At the end of our pipeline there is a database waiting to store our results. Here's a sketch of our pipeline:
Sklearn has a famous API, where you put in training data as an array where every row is a data point and every column is a feature. Our SIFT features however consists of several keypoints and usually the number of relevant keypoints differs from image to image. Therefore in order to stack all training datapoints into one array we need to define a fixed number of SIFT key points even for images, where more key points would be available. We are aware that this means we're not using valuable information for the classification but since the goal of our project is to combine OpenCV functions with sklearn classifieres in a pipeline we accept it.
The key to sklearns pipeline feature is that every element of the pipeline has a transform and a fit method. Sklearn will even raise an error if you try to build a pipeline with elements that do not have these methods implemented! However when actually running the pipeline, only the last element uses its fit/predict method. All the previous elements simply take their input and forward the transformed output. Here is how we made OpenCVs sift computation into a sklearn pipeline element:
import cv2
class Sift(object):
def __init__(self, k):
self.nofpoints = k
self.sift = cv2.xfeatures2d.SIFT_create()
def fit(self, *args, **kwargs):
pass
def transform(self, X, **kwargs):
kp, des = self.sift.detectAndCompute(X,None)
assert des.shape[0] >= self.nofp
return des[:self.nofp,:].flatten().reshape(1, -1)
Now that we have a SIFT class with a transform method we can use sklearns support vector machine and pipeline implementation to create our pipeline:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
pipe = Pipeline([('sift', Sift(10)), ('svm', SVC(gamma='auto'))])
Finally when running the pipeline on testing data we added one last thing to our project. We thought in a real world example you usually don't just want a trained pipeline but you also want to run that pipeline on actual data and the results do actually matter. To simulate that we stored the results of our test runs in a database. We used the popular redis key-value store, because it is very easy to set up and it even has its own pipeline feature! After every prediction we tell redis to store the result, and the redis pipeline works as a buffer that collects all our commands and executes them once the test run is over.
Even though the focus of this project was not to solve the kaggle task but to play with the pipeline features, we did evaluate our pipeline on some test data and looked at how the accuracy improved with an increasing number of training samples per category used for the training. When we downloaded it, the kaggle task had around 200 images per category, we used up to 100 images per category for the training. Here the results we observed:
It is notable that with just 10 samples per category the trained pipeline is actually worse than guessing. We stopped increasing the training data, when the rate of improvement began to slow down. The code is on our github.

Keine Kommentare:
Kommentar veröffentlichen