Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • Software and Tools
    • School Learning
    • Practice Coding Problems
  • GfG Premium
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Top Datasets for data visualization
Next article icon

Exploratory Data Analysis on Iris Dataset

Last Updated : 02 May, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Exploratory Data Analysis (EDA) is a technique to analyze data using some visual Techniques. With this technique, we can get detailed information about the statistical summary of the data. We will also be able to deal with the duplicates values, outliers, and also see some trends or patterns present in the dataset.

Now let's see a brief about the Iris dataset.

Iris Dataset

If you are from a data science background you all must be familiar with the Iris Dataset. If you are not then don't worry we will discuss this here.

Iris Dataset is considered as the Hello World for data science. It contains five columns namely - Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a flowering plant, the researchers have measured various features of the different iris flowers and recorded them digitally.

You can download the Iris.csv file from the link. Now we will use the Pandas library to load this CSV file, and we will convert it into the dataframe. read_csv() method is used to read CSV files.

Example:

Python
import pandas as pd

# Reading the CSV file
df = pd.read_csv("Iris.csv")

# Printing top 5 rows
df.head()

Output:

read csv eda iris

Getting Information about the Dataset

We will use the shape parameter to get the shape of the dataset.

Example:

Python
df.shape

Output:

(150, 6)

We can see that the dataframe contains 6 columns and 150 rows.

Now, let's also the columns and their data types. For this, we will use the info() method.

Example:

Python
df.info()

Output:

iris info EDA

We can see that only one column has categorical data and all the other columns are of the numeric type with non-Null entries.

Let's get a quick statistical summary of the dataset using the describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of the distribution of data.

Example:

Python
df.describe()

Output:

iris describe EDA

We can see the count of each column along with their mean value, standard deviation, minimum and maximum values.

Checking Missing Values

We will check if our data contains any missing values or not. Missing values can occur when no information is provided for one or more items or for a whole unit. We will use the isnull() method.

Example:

Python
df.isnull().sum()

Output:

missing value iris EDA

We can see that no column as any missing value.

Note: For more information, refer Working with Missing Data in Pandas.

Checking Duplicates

Let's see if our dataset contains any duplicates or not. Pandas drop_duplicates() method helps in removing duplicates from the data frame.

Example:

Python
data = df.drop_duplicates(subset ="Species",)
data

Output:

iris duplicates EDA

We can see that there are only three unique species. Let's see if the dataset is balanced or not i.e. all the species contain equal amounts of rows or not. We will use the Series.value_counts() function. This function returns a Series containing counts of unique values. 

Example:

Python
df.value_counts("Species")

Output:

value count iris EDA

We can see that all the species contain an equal amount of rows, so we should not delete any entries.

Data Visualization

Visualizing the target column

Our target column will be the Species column because at the end we will need the result according to the species only. Let's see a countplot for species.

Note: We will use Matplotlib and Seaborn library for the data visualization. If you want to know about these modules refer to the articles - 

  • Matplotlib Tutorial
  • Python Seaborn Tutorial

Example:

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt


sns.countplot(x='Species', data=df, )
plt.show()

Output:

countplot iris EDA

Relation between variables

We will see the relationship between the sepal length and sepal width and also between petal length and petal width.

Example 1: Comparing Sepal Length and Sepal Width

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt


sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
                hue='Species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:

relation between variables iris EDA

From the above plot, we can infer that - 

  • Species Setosa has smaller sepal lengths but larger sepal widths.
  • Versicolor Species lies in the middle of the other two species in terms of sepal length and width
  • Species Virginica has larger sepal lengths but smaller sepal widths.

Example 2: Comparing Petal Length and Petal Width

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt


sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',
                hue='Species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:

relation between variables peta iris EDA

From the above plot, we can infer that - 

  • Species Setosa has smaller petal lengths and widths.
  • Versicolor Species lies in the middle of the other two species in terms of petal length and width
  • Species Virginica has the largest of petal lengths and widths.

Let's plot all the column's relationships using a pairplot. It can be used for multivariate analysis.

Example:

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt


sns.pairplot(df.drop(['Id'], axis = 1), 
             hue='Species', height=2)

Output:

pairplot iris EDA

We can see many types of relationships from this plot such as the species Setosa has the smallest of petals widths and lengths. It also has the smallest sepal length but larger sepal widths. Such information can be gathered about any other species.

Histograms

Histograms allow seeing the distribution of data for various columns. It can be used for uni as well as bi-variate analysis.

Example:

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt


fig, axes = plt.subplots(2, 2, figsize=(10,10))

axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)

axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5);

axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6);

axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6);

Output:

histogram iris EDA

From the above plot, we can see that - 

  • The highest frequency of the sepal length is between 30 and 35 which is between 5.5 and 6
  • The highest frequency of the sepal Width is around 70 which is between 3.0 and 3.5
  • The highest frequency of the petal length is around 50 which is between 1 and 2
  • The highest frequency of the petal width is between 40 and 50 which is between 0.0 and 0.5

Histograms with Distplot Plot

Distplot is used basically for the univariant set of observations and visualizes it through a histogram i.e. only one observation and hence we choose one particular column of the dataset.

Example:

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "SepalLengthCm").add_legend()

plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "SepalWidthCm").add_legend()

plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "PetalLengthCm").add_legend()

plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "PetalWidthCm").add_legend()

plt.show()

Output:

distplot iris EDAdisplot iris EDA

From the above plots, we can see that - 

  • In the case of Sepal Length, there is a huge amount of overlapping.
  • In the case of Sepal Width also, there is a huge amount of overlapping.
  • In the case of Petal Length, there is a very little amount of overlapping.
  • In the case of Petal Width also, there is a very little amount of overlapping.

So we can use Petal Length and Petal Width as the classification feature.

Handling Correlation

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. Any NA values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.

Example:

Python
data.select_dtypes(include=['number']).corr(method='pearson')

# This code is modified by Susobhan Akhuli

Output:

iris corr EDA

Heatmaps

The heatmap is a data visualization technique that is used to analyze the dataset as colors in two dimensions. Basically, it shows a correlation between all numerical variables in the dataset. In simpler terms, we can plot the above-found correlation using the heatmaps.

Example:

Python
# importing packages 
import seaborn as sns 
import matplotlib.pyplot as plt 


sns.heatmap(df.select_dtypes(include=['number']).corr(method='pearson').drop( 
['Id'], axis=1).drop(['Id'], axis=0), 
			annot = True); 

plt.show()

# This code is modified by Susobhan Akhuli

Output:

heatmap iris EDA

From the above graph, we can see that -

  • Petal width and petal length have high correlations. 
  • Petal length and sepal width have good correlations.
  • Petal Width and Sepal length have good correlations.

Box Plots

We can use boxplots to see how the categorical value os distributed with other numerical values.

Example:

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

def graph(y):
    sns.boxplot(x="Species", y=y, data=df)

plt.figure(figsize=(10,10))
    
# Adding the subplot at the specified
# grid position
plt.subplot(221)
graph('SepalLengthCm')

plt.subplot(222)
graph('SepalWidthCm')

plt.subplot(223)
graph('PetalLengthCm')

plt.subplot(224)
graph('PetalWidthCm')

plt.show()

Output:

boxplot iris EDA

From the above graph, we can see that - 

  • Species Setosa has the smallest features and less distributed with some outliers.
  • Species Versicolor has the average features.
  • Species Virginica has the highest features

Handling Outliers

An Outlier is a data-item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect the outliers, and the removal process is the data frame same as removing a data item from the panda’s dataframe.

Let's consider the iris dataset and let's plot the boxplot for the SepalWidthCm column.

Example:

Python
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('Iris.csv')

sns.boxplot(x='SepalWidthCm', data=df)

Output:

In the above graph, the values above 4 and below 2 are acting as outliers.

Removing Outliers

For removing the outlier, one must follow the same process of removing an entry from the dataset using its exact position in the dataset because in all the above methods of detecting the outliers end result is the list of all those data items that satisfy the outlier definition according to the method used.

Example: We will detect the outliers using IQR and then we will remove them. We will also draw the boxplot to see if the outliers are removed or not.

Python
# Importing
import numpy as np

# Load the dataset 
df = pd.read_csv('Iris.csv') 

# IQR 
Q1 = np.percentile(df['SepalWidthCm'], 25, 
				interpolation = 'midpoint') 

Q3 = np.percentile(df['SepalWidthCm'], 75, 
				interpolation = 'midpoint') 
IQR = Q3 - Q1 

print("Old Shape: ", df.shape) 

# Upper bound 
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR)) 

# Lower bound 
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR)) 

# Removing the Outliers 
df.drop(upper[0], inplace = True) 
df.drop(lower[0], inplace = True) 

print("New Shape: ", df.shape) 

sns.boxplot(x='SepalWidthCm', data=df)

# This code is modified by Susobhan Akhuli

Output:

remove outliers EDA

Note: for more information, refer Detect and Remove the Outliers using Python

Get the complete notebook and dataset link here:

Notebook link : click here.

Dataset Link: click here


EDA on IRIS Dataset using Logistic Regression Model
Next Article
Top Datasets for data visualization

A

abhishek1
Improve
Article Tags :
  • Data Analysis
  • AI-ML-DS
  • ML-EDA
  • AI-ML-DS With Python

Similar Reads

    What is Exploratory Data Analysis?
    Exploratory Data Analysis (EDA) is a important step in data science as it visualizing data to understand its main features, find patterns and discover how different parts of the data are connected. In this article, we will see more about Exploratory Data Analysis (EDA).Why Exploratory Data Analysis
    8 min read
    Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn
    Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming,
    4 min read
    Top Datasets for data visualization
    Data Visualization is a graphical structure representing the data to share its insight information. Whether you're a data scientist, analyst, or enthusiast, working with high-quality datasets is essential for creating compelling visualizations that tell a story and provide valuable insights. Top Dat
    7 min read
    Data Analysis (Analytics) Tutorial
    Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance.
    4 min read
    Seaborn Datasets For Data Science
    Seaborn, a Python data visualization library, offers a range of built-in datasets that are perfect for practicing and demonstrating various data science concepts. These datasets are designed to be simple, intuitive, and easy to work with, making them ideal for beginners and experienced data scientis
    7 min read
    Analyzing Decision Tree and K-means Clustering using Iris dataset
    Iris Dataset is one of best know datasets in pattern recognition literature. This dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2 the latter are NOT linearly separable from each other. Attribute Inform
    5 min read
`; $(commentSectionTemplate).insertBefore(".article--recommended"); } loadComments(); }); }); function loadComments() { if ($("iframe[id*='discuss-iframe']").length top_of_element && top_of_screen articleRecommendedTop && top_of_screen articleRecommendedBottom)) { if (!isfollowingApiCall) { isfollowingApiCall = true; setTimeout(function(){ if (loginData && loginData.isLoggedIn) { if (loginData.userName !== $('#followAuthor').val()) { is_following(); } else { $('.profileCard-profile-picture').css('background-color', '#E7E7E7'); } } else { $('.follow-btn').removeClass('hideIt'); } }, 3000); } } }); } $(".accordion-header").click(function() { var arrowIcon = $(this).find('.bottom-arrow-icon'); arrowIcon.toggleClass('rotate180'); }); }); window.isReportArticle = false; function report_article(){ if (!loginData || !loginData.isLoggedIn) { const loginModalButton = $('.login-modal-btn') if (loginModalButton.length) { loginModalButton.click(); } return; } if(!window.isReportArticle){ //to add loader $('.report-loader').addClass('spinner'); jQuery('#report_modal_content').load(gfgSiteUrl+'wp-content/themes/iconic-one/report-modal.php', { PRACTICE_API_URL: practiceAPIURL, PRACTICE_URL:practiceURL },function(responseTxt, statusTxt, xhr){ if(statusTxt == "error"){ alert("Error: " + xhr.status + ": " + xhr.statusText); } }); }else{ window.scrollTo({ top: 0, behavior: 'smooth' }); $("#report_modal_content").show(); } } function closeShareModal() { const shareOption = document.querySelector('[data-gfg-action="share-article"]'); shareOption.classList.remove("hover_share_menu"); let shareModal = document.querySelector(".hover__share-modal-container"); shareModal && shareModal.remove(); } function openShareModal() { closeShareModal(); // Remove existing modal if any let shareModal = document.querySelector(".three_dot_dropdown_share"); shareModal.appendChild(Object.assign(document.createElement("div"), { className: "hover__share-modal-container" })); document.querySelector(".hover__share-modal-container").append( Object.assign(document.createElement('div'), { className: "share__modal" }), ); document.querySelector(".share__modal").append(Object.assign(document.createElement('h1'), { className: "share__modal-heading" }, { textContent: "Share to" })); const socialOptions = ["LinkedIn", "WhatsApp","Twitter", "Copy Link"]; socialOptions.forEach((socialOption) => { const socialContainer = Object.assign(document.createElement('div'), { className: "social__container" }); const icon = Object.assign(document.createElement("div"), { className: `share__icon share__${socialOption.split(" ").join("")}-icon` }); const socialText = Object.assign(document.createElement("span"), { className: "share__option-text" }, { textContent: `${socialOption}` }); const shareLink = (socialOption === "Copy Link") ? Object.assign(document.createElement('div'), { role: "button", className: "link-container CopyLink" }) : Object.assign(document.createElement('a'), { className: "link-container" }); if (socialOption === "LinkedIn") { shareLink.setAttribute('href', `https://www.linkedin.com/sharing/share-offsite/?url=${window.location.href}`); shareLink.setAttribute('target', '_blank'); } if (socialOption === "WhatsApp") { shareLink.setAttribute('href', `https://api.whatsapp.com/send?text=${window.location.href}`); shareLink.setAttribute('target', "_blank"); } if (socialOption === "Twitter") { shareLink.setAttribute('href', `https://twitter.com/intent/tweet?url=${window.location.href}`); shareLink.setAttribute('target', "_blank"); } shareLink.append(icon, socialText); socialContainer.append(shareLink); document.querySelector(".share__modal").appendChild(socialContainer); //adding copy url functionality if(socialOption === "Copy Link") { shareLink.addEventListener("click", function() { var tempInput = document.createElement("input"); tempInput.value = window.location.href; document.body.appendChild(tempInput); tempInput.select(); tempInput.setSelectionRange(0, 99999); // For mobile devices document.execCommand('copy'); document.body.removeChild(tempInput); this.querySelector(".share__option-text").textContent = "Copied" }) } }); // document.querySelector(".hover__share-modal-container").addEventListener("mouseover", () => document.querySelector('[data-gfg-action="share-article"]').classList.add("hover_share_menu")); } function toggleLikeElementVisibility(selector, show) { document.querySelector(`.${selector}`).style.display = show ? "block" : "none"; } function closeKebabMenu(){ document.getElementById("myDropdown").classList.toggle("show"); }
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences