Top 10 Pandas Functions Every Beginner Should Know

Python, Dec-19-20245 minutos de leitura

Pandas is the go-to library for data analysts and Python programmers venturing into the world of data manipulation and analysis. Its intuitive syntax and powerful data structures make handling vast datasets not only manageable but also efficient. Whether you're importing CSV files, cleaning messy datasets, or analyzing data trends, Pandas has the tools you need.

If you’re just starting your Pandas learning journey, this post will introduce you to 10 essential functions and methods that every beginner should master. These functions will help you load, inspect, clean, and analyze data with ease.
Here is a quick overview of each method and what it does:

  • read_csv(): Loads data from a CSV file into a Pandas DataFrame.
  • head(): Displays the first few rows of the DataFrame.
  • info(): Provides a concise summary of the DataFrame, including column types and missing values.
  • describe(): Generates descriptive statistics for numeric columns.
  • isnull(): Identifies missing data in the DataFrame.
  • dropna(): Removes rows or columns with missing data.
  • fillna(): Replaces missing data with specified values.
  • groupby(): Groups data based on one or more columns and applies aggregate functions.
  • sort_values(): Sorts the DataFrame by one or more columns.
  • apply(): Applies custom functions to rows or columns of the DataFrame.

Prerequisites

First, ensure that Python is installed on your machine. If it’s not, you can download it from this link.
Once Python is set up, confirm that the Pandas library is installed. If it isn’t, you can easily add it using the following command:

pip install pandas

We will use the following real estate dummy dataset for our examples:

PropertyIDLocalizaçãoBedroomsBathroomsPreçoSquareFootageListingDate
101New York3275000018002023-05-15
102San Francisco2185000014002023-06-01
103Los Angeles4325002023-04-10
104Chicago326500001700
105Miami54120000035002023-03-22
106Houston4345000020002023-07-01

Top 10 Pandas Functions and Methods

1. read_csv(): Loads data from a CSV file into a Pandas DataFrame.

The first step in using pandas is loading your data to prepare it for processing or analysis. One of the most common methods for importing data into a pandas DataFrame is the read_csv() function.

Code Example:

import pandas as pd
# Load the dataset into a Pandas DataFrame
file_path = "real_estate_listings.csv"  # Replace with your actual file path

# Create the DataFrame
df = pd.read_csv(file_path)

First, we begin by importing the pandas library. Using as pd is a common convention among developers to simplify code. For instance, instead of writing pandas.read_csv(), we can simply use pd.read_csv(), making the code more concise and easier to read.

Next, specify the path to the CSV file you want to load into a Pandas DataFrame. Then, create a variable and use the read_csv method to load the file into the DataFrame.

Key Parameters:

Now let us explore some important parameters that follow this method:

  • delimiter: Specify the delimiter used in the CSV file (e.g., ,, ;, \t).
df = pd.read_csv(file_path, delimiter=',')
  • index_col: In our table, each property listing is identified by an ID found in the "PropertyID" column. To ensure pandas uses this column as the unique identifier for each property, we can explicitly set it as the ID column. Otherwise, pandas will automatically generate and assign its own ID values to each listing.
df = pd.read_csv(file_path, index_col='PropertyID')
  • usecols: If your CSV file contains numerous columns but you only need a select few, pandas allows you to easily achieve this using a specific parameter.
df = pd.read_csv(file_path, usecols=['Location', 'Price', 'Bedrooms'])

2. head() and tail(): Inspect Data

Now that the data is loaded into the DataFrame, we can begin exploring it. To do so, Pandas provides two very handy methods: head() and tail(). These methods are essential for quickly inspecting the first or last few rows of your dataset.

  • head(): This method shows the first 5 rows of the dataset by default. It’s particularly useful to verify that the data has been loaded correctly and to check its structure.
  • tail(): This method works similarly to head() but displays the last few rows instead. It’s helpful for examining the end of the dataset, especially when checking if data is complete or trailing records have missing values.

Code Example:

# View the first 5 rows
print(df.head())
# View the last 3 rows
print(df.tail(3))

Output of head():

PropertyIDLocalizaçãoBedroomsBathroomsPreçoSquareFootageListingDate
101New York3275000018002023-05-15
102San Francisco2185000014002023-06-01
103Los Angeles4325002023-04-10
104Chicago326500001700
105Miami54120000035002023-03-22

Output of tail():

PropertyIDLocalizaçãoBedroomsBathroomsPreçoSquareFootageListingDate
104Chicago326500001700
105Miami54120000035002023-03-22
106Houston4345000020002023-07-01

3. info() - Understand DataFrame Structure

The info() function provides a concise summary of the DataFrame. It includes details such as:

  • The number of rows and columns.
  • Column names and their data types.
  • The count of non-null values in each column.

This method is useful to quickly understand your dataset's structure and spotting issues like missing values or incorrect data types.

Code Example:

# Get a concise summary of the DataFrame
print(df.info())

The code above produces the following output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
#   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
0   PropertyID      6 non-null      int64 
1   Location        6 non-null      object
2   Bedrooms        6 non-null      int64 
3   Bathrooms       6 non-null      int64 
4   Price           5 non-null      float64
5   SquareFootage   6 non-null      int64 
6   ListingDate     5 non-null      object
dtypes: float64(1), int64(4), object(2)
memory usage: 464.0+ bytes

4. describe(): Generate Summary Statistics

The describe() method gives a quick summary of key statistical metrics like mean, standard deviation, percentiles, and more. By default, describe() works with numeric data but can also handle categorical data, offering tailored insights based on data type.

Key Parameters:

  • percentiles: A list of numbers between 0 and 1, specifying which percentiles to return. The default is None, which returns the 25th, 50th, and 75th percentiles. Read more here.
  • include: A list of data types to include in the summary. You can specify data types such as int, float, object (for strings), etc. The default is None, meaning all numeric types are included.
  • exclude: A list of data types to exclude from the summary. This parameter is also None by default, meaning no types are excluded.

Code Example:

print(df.describe())

The output of the code is as follows:

       PropertyID  Location  Bedrooms  ...         Price  SquareFootage  ListingDate
count     6.000000         6  6.000000  ...  5.000000e+00       6.000000            5
unique         NaN         6       NaN  ...           NaN            NaN            5
top            NaN  New York       NaN  ...           NaN            NaN   2023-05-15
freq           NaN         1       NaN  ...           NaN            NaN            1
mean    103.500000       NaN  3.500000  ...  7.800000e+05    2150.000000          NaN
std       1.870829       NaN  1.048809  ...  2.774887e+05     755.645419          NaN
min     101.000000       NaN  2.000000  ...  4.500000e+05    1400.000000          NaN
25%     102.250000       NaN  3.000000  ...  6.500000e+05    1725.000000          NaN
50%     103.500000       NaN  3.500000  ...  7.500000e+05    1900.000000          NaN
75%     104.750000       NaN  4.000000  ...  8.500000e+05    2375.000000          NaN
max     106.000000       NaN  5.000000  ...  1.200000e+06    3500.000000          NaN

Explanation of Each Statistic

  • Contagem: The number of non-missing entries for each column.
    • Exemplo: count = 6 for Bedrooms, indicating all rows have valid data.
  • Mean: The average value of the column.
    • Exemplo: mean = 3.5 for Bedrooms, indicating an average of 3.5 bedrooms per property.
  • Std: The standard deviation, showing the spread of values from the mean.
    • Exemplo: std = 1.048809 for Bedrooms, meaning there is moderate variation in the number of bedrooms.
  • Min/Max: The smallest and largest values.
    • Exemplo: min = 2 and max = 5 for Bedrooms.
  • 25%, 50%, 75% (Quartiles): These represent the data's spread:
    • 25%: The value below which 25% of the data falls (first quartile).
    • 50%: The median value.
    • 75%: The value below which 75% of the data falls (third quartile).

5. isnull(): Identify Missing Data

The isnull() function helps identify missing data in a DataFrame. It returns a DataFrame of the same shape as the original, with Verdadeiro where data is missing NaN and Falso otherwise. It’s commonly used with other functions like sum() to count missing values across columns or rows.

Code Example:

# Identify missing values
print(df.isnull())
# Count missing values for each column
print(df.isnull().sum())

df.isnull() Saída:

  PropertyID  Location  Bedrooms  Bathrooms  Price  SquareFootage  ListingDate
0       False     False     False      False  False          False        False
1       False     False     False      False  False          False        False
2       False     False     False      False   True          False        False
3       False     False     False      False  False          False         True
4       False     False     False      False  False          False        False
5       False     False     False      False  False          False        False

df.isnull().sum() Saída:

PropertyID       0
Location         0
Bedrooms         0
Bathrooms        0
Price            1
SquareFootage    0
ListingDate      1
dtype: int64

6. dropna(): Remove Missing Data

The dropna() function removes rows or columns with missing data NaN from the DataFrame. By default, it removes rows with any missing values. You can customize its behavior to remove only columns, or to drop rows/columns where all or a specific subset of values are missing

Code Example:

# Drop rows with any missing values
df_dropped_rows = df.dropna()
print(df_dropped_rows)

# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)

This is what the DataFrame looks like after removing all rows containing NaN values:

PropertyIDLocalizaçãoBedroomsBathroomsPreçoSquareFootageListingDate
101New York3275000018002023-05-15
102San Francisco2185000014002023-06-01
105Miami54120000035002023-03-22
106Houston4345000020002023-07-01

Here is the dataframe with columns containing NaN values removed.

PropertyIDLocalizaçãoBedroomsBathroomsSquareFootage
101New York321800
102San Francisco211400
103Los Angeles432500
104Chicago321700
105Miami543500
106Houston432000

7. fillna(): Replace Missing Data

The fillna() function replaces missing data NaN with specified values. The pandas dropna() method efficiently handles and removes null values from a DataFrame, while the fillna() method offers a flexible solution for replacing NaN values with a specified value of your choice. This method offers a flexible approach to managing missing data by enabling you to fill gaps using a fixed value, the mean, median, mode, or other calculated statistics.

Code Example:

# Fill missing values in the 'Price' column with the mean price df['Price'] = 
df['Price'].fillna(df['Price'].mean()) 
print(df)

The output of the code demonstrates that the NaN value has been successfully replaced with the average price.

PropertyIDLocalizaçãoBedroomsBathroomsPreçoSquareFootageListingDate
101New York3275000018002023-05-15
102San Francisco2185000014002023-06-01
103Los Angeles4378000025002023-04-10
104Chicago326500001700NaN
105Miami54120000035002023-03-22
106Houston4345000020002023-07-01

The ListingDate column also contains a NaN value. Instead of performing a calculation, we can simply use a placeholder such as "unknown. Here is how to do that;

# Fill missing values in the 'ListingDate' column with a placeholder
df['ListingDate'] = df['ListingDate'].fillna('Unknown')
print(df)

The Dataframe now looks as follows:

PropertyIDLocalizaçãoBedroomsBathroomsPreçoSquareFootageListingDate
101New York3275000018002023-05-15
102San Francisco21NaN14002023-06-01
103Los Angeles4378000025002023-04-10
104Chicago326500001700Unknown
105Miami54120000035002023-03-22
106Houston4345000020002023-07-01

8. groupby(): Group and Aggregate Data

The groupby() function in Pandas is a versatile and powerful tool for dividing a DataFrame into groups based on one or more columns, enabling efficient data analysis and aggregation.
It operates on a split-apply-combine principle:
first, the data is split into groups;
next, a specified function is applied to each group;
finally, the results are combined into a new DataFrame.
This streamlined approach simplifies complex data manipulation tasks and enhances analytical efficiency.

Code Example:

For example, you can calculate the average price of properties by location or count the number of listings per bedroom category.

# Group by 'Location' and calculate the average price
avg_price_by_location = df.groupby('Location')['Price'].mean()
print(avg_price_by_location)
# Group by 'Bedrooms' and calculate the total square footage
total_sqft_by_bedrooms = df.groupby('Bedrooms')['SquareFootage'].sum()
print(total_sqft_by_bedrooms)
# Group by multiple columns and calculate the count
count_by_location_bedrooms = df.groupby(['Location', 'Bedrooms']).size()
print(count_by_location_bedrooms)

Average Price by Location:

Location
Chicago           650000.0
Houston           450000.0
Los Angeles            NaN
Miami            1200000.0
New York          750000.0
San Francisco     850000.0
Name: Price, dtype: float64

Total Square Footage by Bedrooms:

Bedrooms
2    1400
3    3500
4    4500
5    3500

Count by Location and Bedrooms:

Location       Bedrooms
Chicago        3           1
Houston        4           1
Los Angeles    4           1
Miami          5           1
New York       3           1
San Francisco  2           1
dtype: int64

9. sort_values(): Sort Data

The sort_values() function is used to sort your DataFrame by one or more columns, either in ascending or descending order. Sorting is essential for ranking data, identifying extremes (e.g., the highest-priced properties), or simply organizing the dataset for better readability.You can sort based on numerical, categorical, or datetime columns.

Code Example:

# Sort by 'Price' in descending order
df_sorted_by_price = df.sort_values(by='Price', ascending=False)
print(df_sorted_by_price)
# Sort by 'Location' in ascending order
df_sorted_by_location = df.sort_values(by='Location')
print(df_sorted_by_location)
# Sort by multiple columns: first by 'Bedrooms' (ascending) and then by 'Price' (descending)
df_sorted_by_bedrooms_price = df.sort_values(by=['Bedrooms', 'Price'], ascending=[True, False])
print(df_sorted_by_bedrooms_price)

Sort by Price (Descending):

  PropertyID       Location  Bedrooms  ...      Price  SquareFootage  ListingDate
4         105          Miami         5  ...  1200000.0           3500   2023-03-22
1         102  San Francisco         2  ...   850000.0           1400   2023-06-01
0         101       New York         3  ...   750000.0           1800   2023-05-15
3         104        Chicago         3  ...   650000.0           1700          NaN
5         106        Houston         4  ...   450000.0           2000   2023-07-01
2         103    Los Angeles         4  ...        NaN           2500   2023-04-10

Sort by Location (Ascending):

[6 rows x 7 columns]
  PropertyID       Location  Bedrooms  ...      Price  SquareFootage  ListingDate
3         104        Chicago         3  ...   650000.0           1700          NaN
5         106        Houston         4  ...   450000.0           2000   2023-07-01
2         103    Los Angeles         4  ...        NaN           2500   2023-04-10
4         105          Miami         5  ...  1200000.0           3500   2023-03-22
0         101       New York         3  ...   750000.0           1800   2023-05-15
1         102  San Francisco         2  ...   850000.0           1400   2023-06-01

Sort by Bedrooms (Ascending) and Price (Descending):

  PropertyID       Location  Bedrooms  ...      Price  SquareFootage  ListingDate
1         102  San Francisco         2  ...   850000.0           1400   2023-06-01
0         101       New York         3  ...   750000.0           1800   2023-05-15
3         104        Chicago         3  ...   650000.0           1700          NaN
5         106        Houston         4  ...   450000.0           2000   2023-07-01
2         103    Los Angeles         4  ...        NaN           2500   2023-04-10
4         105          Miami         5  ...  1200000.0           3500   2023-03-22

10. apply(): Apply Custom Functions to Data

The apply() function allows you to apply custom functions to rows or columns in a DataFrame. It’s a powerful method for performing element-wise transformations, calculations, or conditional operations that go beyond built-in Pandas functions.

Code Example:

Let's an example where we aim to create a new column to calculate the price per square foot.

# Define a custom function to calculate price per square foot
def price_per_sqft(price, sqft):
   return price / sqft if sqft != 0 else 0

# Apply the custom function to create a new column
df['PricePerSqFt'] = df.apply(lambda row: price_per_sqft(row['Price'], row['SquareFootage']), axis=1)
print(df)

Here is the output of the above code:

  PropertyID       Location  Bedrooms  ...  SquareFootage  ListingDate  PricePerSqFt
0         101       New York         3  ...           1800   2023-05-15    416.666667
1         102  San Francisco         2  ...           1400   2023-06-01    607.142857
2         103    Los Angeles         4  ...           2500   2023-04-10           NaN
3         104        Chicago         3  ...           1700          NaN    382.352941
4         105          Miami         5  ...           3500   2023-03-22    342.857143
5         106        Houston         4  ...           2000   2023-07-01    225.000000

Another example could be converting location names on uppercase;

# Apply a transformation to uppercase all location names
df['Location'] = df['Location'].apply(lambda x: x.upper())
print(df)

Here is how that looks:

   PropertyID       Location  Bedrooms  ...      Price  SquareFootage  ListingDate
0         101       NEW YORK         3  ...   750000.0           1800   2023-05-15
1         102  SAN FRANCISCO         2  ...   850000.0           1400   2023-06-01
2         103    LOS ANGELES         4  ...        NaN           2500   2023-04-10
3         104        CHICAGO         3  ...   650000.0           1700          NaN
4         105          MIAMI         5  ...  1200000.0           3500   2023-03-22
5         106        HOUSTON         4  ...   450000.0           2000   2023-07-01

Conclusão

In this guide, we explored 10 essential Pandas functions for data analysis, from loading and inspecting data with read_csv() and info() to cleaning and transforming it using methods like isnull(), fillna(), and apply(). These functions form the foundation of any data analysis workflow, enabling you to clean, manipulate, and summarize datasets effectively. Mastering them will prepare you to handle real-world data challenges with confidence. Start practicing today, and explore how these tools can simplify your analysis tasks!