Scikit-learn, the Iris Dataset, and Machine Learning: The Journey to a New Skill
I’m already familiar and relatively capable with Python, which you probably know if you read my last blog post, so now it’s time to get started with my machine learning journey with a special python module. Scikit-learn is a set of python modules for machine learning and data mining. I’ve been using Caleb Stultz lesson on Machine Learning for Apps to learn more about ML. Scikit-learn has data sets that can be used to practice and make use of machine learning, and per Caleb’s lesson I started with the iris data set.
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor) with four features measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other, which makes it useful for machine learning.
With machine learning, there’s essentially five major steps: get the data, prepare the data, train the model, test data, and improve. Since we’re using the iris data set from scikit-learn, we really only need to prepare the data. You must distinguish the features and label to prepare your data. The features are what you use in machine learning to determine the label of what you’re trying to teach your “machine” model, so the features in the iris example would be the length and width of the sepals and petals; while, the labels would be the three different types of Iris. Once I imported the dataset from scikit-learn, I separated it into features and labels. Scikit-learn’s model selection module has function that allows you to separate your data into training and test data. The function is conveniently named test_train_split. You can create variables for the training and test data then set it equal to test_train_split with a few arguments like below:
I’m setting the test size to 50%, so half the data is being used for testing and training. With the data properly prepped, I can use a classifier to help to determine what’s what of the data set aside for testing. We used KNeighborsClassifier, from scikit-learn neighbors, which plots the data and compares the data point’s label to its nearest neighbors.
The label that appears near the data point the most will determine its label. Before that we must use a function from KNeighborsClassifier to fit our training data. Once the data is fit, you can then start testing your data using a predict function on the features test data.
With all of that, you might be wondering how correct is your data? Well, there’s another scikit-learn function that be used from metrics, called accuracy score. It can compare arrays to other arrays, so you can compare the features prediction array to the actual labels test data and get the accuracy score. If you have an accuracy score of 90%+, you have a solid machine learning model on your hands.
You can now test random data, that’s within the range of data, to predict which type of iris plant you have in our case, or whatever label you would use in machine learning model. I know it’s not the fanciest use of machine learning but predicting a plant species based on four features/characteristics is still cool to me.