|
|
Pandas (stands for Python Data Analysis) is an open-source software library that is intended for data manipulation and analysis.It evolves around two primary Data structures: Series (1D) and DataFrame (2D).Pandas is built on top of NumPy, and efficiently manages large datasets an also offer ways for data cleaning, transformation, and analysis.Pandas also provide tools for working with time series data, including date range generation and frequency conversion. For example, we can convert date or time columns into pandas’ datetime type using pd.to_datetime(), or specify parse_dates=True during CSV loading. Pandas integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn.
What is Pandas mainly used for?
With pandas, you can perform a wide range of data operations, such as follows
• Pandas are very effective in Data Analysis for various Machine Learning algorithms. It can be used to read and write data from various file formats like CSV, Excel and SQL databases.
• Machine Learning algorithms (both supervised and unsupervised) need clean data. Pandas can clean and prepare data by handling missing values and filtering entries.
• Sometimes data can be obtained from multiple sources, Pandas can help in merging and joining the multiple datasets.
• Pandas can reshape the data through pivoting and stacking operations.
• Pandas can also be used for statistical analysis and generating descriptive statistics.
• Pandas can help in visualizing data with plotting capabilities.
You can create a pandas DataFrame using various data structures from a Python dictionary, a list of lists, or a list of dictionaries and passing the data into DataFrame() methof of Pandas.
The Pandas library should be imported.
import pandas as pd
Here are some common methods for creating a DataFrame:
Code Example 1:
#Create an empty Dataframe
import pandas as pd
df=pd.DataFrame()
#print
print(df)
#Output:
Empty DataFrame
Columns: []
Index: []
Code Example 2:
#Create DataFrame from List
row1=[1,'John', 3.65]
row2=[2,'Mary', 3.76]
row3=[3,'Mike', 3.60]
row4=[4,'Amy', 3.75]
data=[row1,row2,row3,row4]
column_names=['Id','Name','GPA']
df=pd.DataFrame(data, columns=column_names)
#print
print(df)
#Output:
Id Name GPA
0 1 John 3.65
1 2 Mary 3.76
2 3 Mike 3.60
3 4 Amy 3.75
Code Example 3:
#Create DataFrame from Dictionary
dict={'name':["Vinny","Tracy","Mike","John"],
'age': [26, 25, 27, 29],
'weight':[160, 145, 148, 151]
}
df = pd.DataFrame(dict)
#print
print(df)
#Output:
name age weight
0 Vinny 26 160
1 Tracy 25 145
2 Mike 27 148
3 John 29 151
It is a most common way of creating a DataFrame. Use the read_excel() method and pass the path of the excel file as the parameter. If you have data in multiple excel sheets and you want data to create a DataFrame only from a particular excel sheet name, pass the name of the sheet as the second parameter. Let's see some examples.
Code Example 4: #Create Dataframe from Excel file_path = 'my_data.xlsx' # Create the DataFrame df = pd.read_excel(file_path) # View the first 5 rows of the DataFrame print(df.head()) Code Example 5: #Read a specific sheet by name file_path = 'my_data.xlsx' # Create the DataFrame df = pd.read_excel(file_path,sheet_name='Sheet1') # View the first 5 rows of the DataFrame print(df.head())
In Pandas workflows, LabelEncoder is a tool that is used to convert categorical text data into numerical labels (integers). This is a critical preprocessing step because string values cannot be processed by most of the machine learning algorithms and hence require numerical input for training and prediction. We can achieve this using LabelEncoder from sklearn. Let's see an example below.
Code Example 6:
#Dataframe with categotical values for gender
import pandas as pd
from sklearn.preprocessing import LabelEncoder
dict={'name':["Vinny","Tracy","Mike","John"],
'gender':['M', 'F', 'M', 'M'],
'weight':[160, 145, 148, 151]
}
df = pd.DataFrame(dict)
le = LabelEncoder()
df['gender']=le.fit_transform(df['gender'])
print(df)
#Output:
name gender weight
0 Vinny 1 160
1 Tracy 0 145
2 Mike 1 148
3 John 1 151
To drop a column in DataFrame can be done by the following:
df = df.drop(columns=['col1', 'col2'])
By default drop() returns a new DataFrame. To modify the original object directly, use inplace=True.
df = df.drop(columns=['col1', 'col2'], inplace=True)
Code Example 7:
import pandas as pd
dict={'name':["Vinny","Tracy","Mike","John"],
'gender':['M', 'F', 'M', 'M'],
'weight':[160, 115, 148, 151],
'age':[36, 38, 43, 35]
}
df = pd.DataFrame(dict)
updated_df = df.drop(columns=['gender', 'weight'])
print(updated_df)
#Output:
name age
0 Vinny 36
1 Tracy 38
2 Mike 43
3 John 35
#use inplace=True to modify the original
DataFrame directly:
df = df.drop(columns=['gender', 'weight'], inplace=True)
print(df)
#Output:
name age
0 Vinny 36
1 Tracy 38
2 Mike 43
3 John 35
There are 2 ways to convert Date to a numeric format.
1.) Convert to Unix Timestamp (Seconds or Nanoseconds) and then divide by
1000000000 (10 to the power of 9)
2.) 2. Convert to YYYYMMDD Integer
Let's look at both examples.
Code Example 8:
#Convert to Unix Timestamp (Nanoseconds),
Then divide by 1000000000.
import pandas as pd
dict={"dat": ["2024 03 01 19:23:15.887",
"2024 03 02 11:23:15.887"]}
df=pd.DataFrame(dict)
df['numeric_date'] = pd.to_datetime(df['dat']).astype(int) // 1000000000
print(df)
#Output:
dat numeric_date
0 2024 03 01 19:23:15.887 1709320995
1 2024 03 02 11:23:15.887 1709378595
Code Example 9:
Convert to YYYYMMDD Integer
dict={"dat": ["2024 03 01 19:23:15.887",
"2024 03 02 11:23:15.887"]}
df=pd.DataFrame(dict)
df['date_int'] = pd.to_datetime(df['dat']).dt.strftime('%Y%m%d').astype(int)
print(df)
#Output:
dat date_int
0 2024 03 01 19:23:15.887 20240301
1 2024 03 02 11:23:15.887 20240302
Code Example 10:
Convert to YYYYMMDD Integer
import pandas as pd
dict={"dat": ["12/28/1974",
"8/8/1980"]}
df=pd.DataFrame(dict)
df['date_int'] = pd.to_datetime(df['dat']).dt.strftime('%Y%m%d').astype(int)
print(df)
#Output:
dat date_int
0 12/28/1974 19741228
1 8/8/1980 19800808
Sometimes you may need to append the columns from different DataFrames and create a new DataFrame. Let's see an example.
Code Example 11:
import pandas as pd
# Create a sample DataFrame
data = {'c1': ['A', 'B', 'A', 'C', 'B'],
'c2': ['X', 'Y', 'X', 'Z', 'Y'],
'o1': [10, 20, 30, 40, 50],
'o2': [1.1, 2.2, 3.3, 4.4, 5.5],
'x1':[4,6,7,8,2],
'y1':[7,8,3,9,1]
}
df1=pd.DataFrame(data)
df2=df1[['x1','y1']]
df3=df1['o1']
df_final = pd.concat([df2, df3], axis=1)
print(df_final)
#Output:
x1 y1 o1
0 4 7 10
1 6 8 20
2 7 3 30
3 8 9 40
4 2 1 50
Sometimes you may need to eliminate the negative values DataFrames, or convert the negative values to 0. Let's see some examples.
Code Example 12:
import pandas as pd
# Create a sample DataFrame
data = {'c1': [1, 5, -4, 5, 7],
'c2': [4, 5, 7, 9, 1],
'c3': [10, -3, 5, 4, 7],
'c4': [1, 2, 9, 4, 5],
}
df=pd.DataFrame(data)
#Filter for only rows where a specific column's
#value is greater than or equal to zero.
df = df[df['c1'] >= 0]
df = df[df['c3'] >= 0]
print(df)
#Output:
c1 c2 c3 c4
0 1 4 10 1
3 5 9 4 4
4 7 1 7 5
Code Example 13:
#Replace negative values with 0
import pandas as pd
# Create a sample DataFrame
data = {'c1': [1, 5, -4, 5, 7],
'c2': [4, 5, 7, 9, 1],
'c3': [10, -3, 5, 4, 7],
'c4': [1, 2, 9, 4, 5],
}
df=pd.DataFrame(data)
df['c1'] = df['c1'].clip(lower=0)
df['c3'] = df['c3'].clip(lower=0)
print(df)
#Output:
c1 c2 c3 c4
0 1 4 10 1
1 5 5 0 2
2 0 7 5 9
3 5 9 4 4
4 7 1 7 5
Code Example 14:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {'c1': [1, 5, " ", 5, 7],
'c2': [4, 5, 7, 9, 1],
'c3': [10, -3, 5, 4, 7],
'c4': [1, 23, '', 4, 5.],
}
df=pd.DataFrame(data)
df=df.replace(r'^\s*$', np.nan, regex=True).dropna()
#If you want Daframe to have only integers, then use astype(int)
df = df.astype(int)
print(df)
#Output:
c1 c2 c3 c4
0 1 4 10 1
1 5 5 -3 23
3 5 9 4 4
4 7 1 7 5
Code Example 15:
import pandas as pd
df = pd.read_excel('file_data.xlsx')
# Drop rows where at least one element is missing
df_cleaned = df.dropna()
# Drop rows only if all columns are missing
df_cleaned_all = df.dropna(how='all')
df_cleaned.to_excel('cleaned_file.xlsx', index=False)
Code Example 16:
import pandas as pd
# Creating a small sample DataFrame
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
'Score': [81, 91, 78, 92, 88, 75, 95]}
df = pd.DataFrame(data)
# 1. Using head() to see the first 5 rows (default)
print("--- First 5 Rows (head) ---")
print(df.head())
#Output:
Name Score
0 A 81
1 B 91
2 C 78
3 D 92
4 E 88
# 2. Using head(n) to see a custom number of top rows
print("\n--- First 3 Rows (head(3)) ---")
print(df.head(3))
#Output:
Name Score
0 A 81
1 B 91
2 C 78
# 3. Using tail() to see the last 5 rows (default)
print("\n--- Last 5 Rows (tail) ---")
print(df.tail())
#Output:
Name Score
2 C 78
3 D 92
4 E 88
5 F 75
6 G 95
# 4. Using tail(n) to see a custom number of bottom rows
print("\n--- Last 2 Rows (tail(2)) ---")
print(df.tail(2))
#Output:
Name Score
5 F 75
6 G 95
Code Example 15:
import pandas as pd
df = pd.read_excel('file_data.xlsx')
# Drop rows where at least one element is missing
df_cleaned = df.dropna()
# Drop rows only if all columns are missing
df_cleaned_all = df.dropna(how='all')
df_cleaned.to_excel('cleaned_file.xlsx', index=False)