PythonPlaza.com

Pandas

Pandas (stands for Python Data Analysis) is an open-source software library that is intended for data manipulation and analysis.It evolves around two primary Data structures: Series (1D) and DataFrame (2D).Pandas is built on top of NumPy, and efficiently manages large datasets an also offer ways for data cleaning, transformation, and analysis.Pandas also provide tools for working with time series data, including date range generation and frequency conversion. For example, we can convert date or time columns into pandas’ datetime type using pd.to_datetime(), or specify parse_dates=True during CSV loading. Pandas integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn.

What is Pandas mainly used for?

With pandas, you can perform a wide range of data operations, such as follows
• Pandas are very effective in Data Analysis for various Machine Learning algorithms. It can be used to read and write data from various file formats like CSV, Excel and SQL databases.

• Machine Learning algorithms (both supervised and unsupervised) need clean data. Pandas can clean and prepare data by handling missing values and filtering entries.

• Sometimes data can be obtained from multiple sources, Pandas can help in merging and joining the multiple datasets.

• Pandas can reshape the data through pivoting and stacking operations.

• Pandas can also be used for statistical analysis and generating descriptive statistics.

• Pandas can help in visualizing data with plotting capabilities.

Create DataFrame

You can create a pandas DataFrame using various data structures from a Python dictionary, a list of lists, or a list of dictionaries and passing the data into DataFrame() methof of Pandas. The Pandas library should be imported.

import pandas as pd

Here are some common methods for creating a DataFrame:

Code Example 1: #Create an empty Dataframe import pandas as pd df=pd.DataFrame() #print print(df) #Output: Empty DataFrame Columns: [] Index: [] Code Example 2: #Create DataFrame from List row1=[1,'John', 3.65] row2=[2,'Mary', 3.76] row3=[3,'Mike', 3.60] row4=[4,'Amy', 3.75] data=[row1,row2,row3,row4] column_names=['Id','Name','GPA'] df=pd.DataFrame(data, columns=column_names) #print print(df) #Output: Id Name GPA 0 1 John 3.65 1 2 Mary 3.76 2 3 Mike 3.60 3 4 Amy 3.75 Code Example 3: #Create DataFrame from Dictionary dict={'name':["Vinny","Tracy","Mike","John"], 'age': [26, 25, 27, 29], 'weight':[160, 145, 148, 151] } df = pd.DataFrame(dict) #print print(df) #Output: name age weight 0 Vinny 26 160 1 Tracy 25 145 2 Mike 27 148 3 John 29 151

Create Dataframe from Excel sheet

It is a most common way of creating a DataFrame. Use the read_excel() method and pass the path of the excel file as the parameter. If you have data in multiple excel sheets and you want data to create a DataFrame only from a particular excel sheet name, pass the name of the sheet as the second parameter. Let's see some examples.

Code Example 4: #Create Dataframe from Excel file_path = 'my_data.xlsx' # Create the DataFrame df = pd.read_excel(file_path) # View the first 5 rows of the DataFrame print(df.head()) Code Example 5: #Read a specific sheet by name file_path = 'my_data.xlsx' # Create the DataFrame df = pd.read_excel(file_path,sheet_name='Sheet1') # View the first 5 rows of the DataFrame print(df.head())

Handing Categorial text data

In Pandas workflows, LabelEncoder is a tool that is used to convert categorical text data into numerical labels (integers). This is a critical preprocessing step because string values cannot be processed by most of the machine learning algorithms and hence require numerical input for training and prediction. We can achieve this using LabelEncoder from sklearn. Let's see an example below.

Code Example 6: #Dataframe with categotical values for gender import pandas as pd from sklearn.preprocessing import LabelEncoder dict={'name':["Vinny","Tracy","Mike","John"], 'gender':['M', 'F', 'M', 'M'], 'weight':[160, 145, 148, 151] } df = pd.DataFrame(dict) le = LabelEncoder() df['gender']=le.fit_transform(df['gender']) print(df) #Output: name gender weight 0 Vinny 1 160 1 Tracy 0 145 2 Mike 1 148 3 John 1 151

Dropping a column in DataFrame

To drop a column in DataFrame can be done by the following:

df = df.drop(columns=['col1', 'col2'])

By default drop() returns a new DataFrame. To modify the original object directly, use inplace=True.

df = df.drop(columns=['col1', 'col2'], inplace=True)

Code Example 7: import pandas as pd dict={'name':["Vinny","Tracy","Mike","John"], 'gender':['M', 'F', 'M', 'M'], 'weight':[160, 115, 148, 151], 'age':[36, 38, 43, 35] } df = pd.DataFrame(dict) updated_df = df.drop(columns=['gender', 'weight']) print(updated_df) #Output: name age 0 Vinny 36 1 Tracy 38 2 Mike 43 3 John 35 #use inplace=True to modify the original DataFrame directly: df = df.drop(columns=['gender', 'weight'], inplace=True) print(df) #Output: name age 0 Vinny 36 1 Tracy 38 2 Mike 43 3 John 35

Code Example 8: #Convert to Unix Timestamp (Nanoseconds), Then divide by 1000000000. import pandas as pd dict={"dat": ["2024 03 01 19:23:15.887", "2024 03 02 11:23:15.887"]} df=pd.DataFrame(dict) df['numeric_date'] = pd.to_datetime(df['dat']).astype(int) // 1000000000 print(df) #Output: dat numeric_date 0 2024 03 01 19:23:15.887 1709320995 1 2024 03 02 11:23:15.887 1709378595 Code Example 9: Convert to YYYYMMDD Integer dict={"dat": ["2024 03 01 19:23:15.887", "2024 03 02 11:23:15.887"]} df=pd.DataFrame(dict) df['date_int'] = pd.to_datetime(df['dat']).dt.strftime('%Y%m%d').astype(int) print(df) #Output: dat date_int 0 2024 03 01 19:23:15.887 20240301 1 2024 03 02 11:23:15.887 20240302 Code Example 10: Convert to YYYYMMDD Integer import pandas as pd dict={"dat": ["12/28/1974", "8/8/1980"]} df=pd.DataFrame(dict) df['date_int'] = pd.to_datetime(df['dat']).dt.strftime('%Y%m%d').astype(int) print(df) #Output: dat date_int 0 12/28/1974 19741228 1 8/8/1980 19800808

Code Example 11: import pandas as pd # Create a sample DataFrame data = {'c1': ['A', 'B', 'A', 'C', 'B'], 'c2': ['X', 'Y', 'X', 'Z', 'Y'], 'o1': [10, 20, 30, 40, 50], 'o2': [1.1, 2.2, 3.3, 4.4, 5.5], 'x1':[4,6,7,8,2], 'y1':[7,8,3,9,1] } df1=pd.DataFrame(data) df2=df1[['x1','y1']] df3=df1['o1'] df_final = pd.concat([df2, df3], axis=1) print(df_final) #Output: x1 y1 o1 0 4 7 10 1 6 8 20 2 7 3 30 3 8 9 40 4 2 1 50

Code Example 12: import pandas as pd # Create a sample DataFrame data = {'c1': [1, 5, -4, 5, 7], 'c2': [4, 5, 7, 9, 1], 'c3': [10, -3, 5, 4, 7], 'c4': [1, 2, 9, 4, 5], } df=pd.DataFrame(data) #Filter for only rows where a specific column's #value is greater than or equal to zero. df = df[df['c1'] >= 0] df = df[df['c3'] >= 0] print(df) #Output: c1 c2 c3 c4 0 1 4 10 1 3 5 9 4 4 4 7 1 7 5 Code Example 13: #Replace negative values with 0 import pandas as pd # Create a sample DataFrame data = {'c1': [1, 5, -4, 5, 7], 'c2': [4, 5, 7, 9, 1], 'c3': [10, -3, 5, 4, 7], 'c4': [1, 2, 9, 4, 5], } df=pd.DataFrame(data) df['c1'] = df['c1'].clip(lower=0) df['c3'] = df['c3'].clip(lower=0) print(df) #Output: c1 c2 c3 c4 0 1 4 10 1 1 5 5 0 2 2 0 7 5 9 3 5 9 4 4 4 7 1 7 5

Code Example 14: import pandas as pd import numpy as np # Create a sample DataFrame data = {'c1': [1, 5, " ", 5, 7], 'c2': [4, 5, 7, 9, 1], 'c3': [10, -3, 5, 4, 7], 'c4': [1, 23, '', 4, 5.], } df=pd.DataFrame(data) df=df.replace(r'^\s*$', np.nan, regex=True).dropna() #If you want Daframe to have only integers, then use astype(int) df = df.astype(int) print(df) #Output: c1 c2 c3 c4 0 1 4 10 1 1 5 5 -3 23 3 5 9 4 4 4 7 1 7 5

Code Example 15: import pandas as pd df = pd.read_excel('file_data.xlsx') # Drop rows where at least one element is missing df_cleaned = df.dropna() # Drop rows only if all columns are missing df_cleaned_all = df.dropna(how='all') df_cleaned.to_excel('cleaned_file.xlsx', index=False)

Code Example 16: import pandas as pd # Creating a small sample DataFrame data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'Score': [81, 91, 78, 92, 88, 75, 95]} df = pd.DataFrame(data) # 1. Using head() to see the first 5 rows (default) print("--- First 5 Rows (head) ---") print(df.head()) #Output: Name Score 0 A 81 1 B 91 2 C 78 3 D 92 4 E 88 # 2. Using head(n) to see a custom number of top rows print("\n--- First 3 Rows (head(3)) ---") print(df.head(3)) #Output: Name Score 0 A 81 1 B 91 2 C 78 # 3. Using tail() to see the last 5 rows (default) print("\n--- Last 5 Rows (tail) ---") print(df.tail()) #Output: Name Score 2 C 78 3 D 92 4 E 88 5 F 75 6 G 95 # 4. Using tail(n) to see a custom number of bottom rows print("\n--- Last 2 Rows (tail(2)) ---") print(df.tail(2)) #Output: Name Score 5 F 75 6 G 95

Pandas

Create DataFrame

Create Dataframe from Excel sheet

Handing Categorial text data

Dropping a column in DataFrame

Convert a date to a numeric format

Append different DataFrames

Handling negative values in DataFrames

How to remove spaces or "" in the DataFrame

Remove rows with missing data in Excel files.

Head() and Tail() method in Pandas.

Test