The Iris dataset is a popular dataset in machine learning and statistics. It contains 150 observations of iris flowers, with 50 observations for each of three species: setosa, versicolor, and virginica. Each observation includes four features: sepal length, sepal width, petal length, and petal width. In this article, we will explore the dataset and its applications.

Background

The Iris dataset was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems". Fisher used the dataset to demonstrate the discriminant analysis, which is a method for finding the linear combination of features that best separates the classes. The dataset has since been widely used in machine learning and statistics, and it is often used as a benchmark for classification algorithms.

Data Exploration

Let's start by loading the dataset and exploring its structure.

import pandas as pd import seaborn as sns iris = sns.load_dataset('iris') print(iris.head())

Output:

sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa

The dataset has 150 rows and 5 columns. The first four columns are the features, and the last column is the species. Let's check the distribution of the features and the classes.

sns.pairplot(iris, hue='species')

The pairplot shows the scatterplots of each pair of features, with the points colored by the species. We can see that the setosa species is clearly separated from the other two species in all features. The versicolor and virginica species are more difficult to distinguish, especially in the overlap area of petal length and petal width.

image.png

Applications

The Iris dataset has been used in many applications, from simple classification to more complex machine learning models. Here are some examples:

K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a simple classification algorithm that assigns a test point to the class of its K nearest neighbors in the training set. The Iris dataset is a classic example of KNN, as the setosa species is easily separable from the other two species. Let's implement KNN with K=5 and measure its accuracy.

from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split X = iris.iloc[:, :4] y = iris.iloc[:, 4] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')

Output:

Accuracy: 0.93

We can see that KNN achieves a high accuracy of 93%, which indicates that the dataset is easy to classify.

Support Vector Machines

Support Vector Machines (SVM) is a powerful classification algorithm that finds the hyperplane that best separates the classes. SVM can handle non-linearly separable data by using the kernel trick, which maps the data to a higher-dimensional space where it is linearly separable. Let's implement SVM with a radial basis function (RBF) kernel and measure its accuracy.

from sklearn.svm import SVC svm = SVC(kernel='rbf') svm.fit(X_train, y_train) y_pred = svm.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')

Output:

Accuracy: 1.00

We can see that SVM achieves a perfect accuracy of 100%, which indicates that the dataset is linearly separable.

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that finds the orthogonal directions of maximum variance in the data and projects the data onto those directions. PCA can be used to visualize high-dimensional data in a lower-dimensional space. Let's apply PCA to the Iris dataset and visualize it in 2D.

from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) iris_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2']) iris_pca['species'] = y sns.scatterplot(data=iris_pca, x='PC1', y='PC2', hue='species')

image.png

We can see that the setosa species is clearly separated from the other two species along the PC1 axis, while the versicolor and virginica species overlap along the PC2 axis.

Conclusion

The Iris dataset is a classic dataset in machine learning and statistics that has been used in many applications. We have explored the dataset and its structure, and we have implemented some simple applications, such as KNN, SVM, and PCA. The dataset is easy to classify, especially the setosa species, which is clearly separable from the other two species. The versicolor and virginica species are more difficult to distinguish, especially in the overlap area of petal length and petal width. Overall, the Iris dataset is a valuable resource for researchers and practitioners in the field of machine learning and statistics.

アイリスデータセット[JA]