Pandas is one of the most powerful and widely used Python libraries for data analysis and data manipulation. It provides easy-to-use data structures and functions needed to work efficiently with structured data. The two most important data structures in Pandas are the Series and the DataFrame.
1. Introduction to Pandas
Pandas is built on top of NumPy and is designed to handle tabular data, time series, and heterogeneous datasets efficiently.
Key Features:
- Easy handling of missing data
- Powerful grouping and aggregation
- Data alignment and indexing
- Integration with other libraries like Matplotlib and Scikit-learn
2. Understanding Pandas Series
A Series is a one-dimensional labeled array capable of holding data of any type (integers, strings, floats, etc.).
Creating a Series
import pandas as pd
data = [10, 20, 30, 40]
s = pd.Series(data)
print(s)
Output:
0 10
1 20
2 30
3 40
dtype: int64
With Custom Index
import pandas as pd
data = [10, 20, 30, 40]
s = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(s)
Output:
a 10
b 20
c 30
d 40
dtype: int64
Key Characteristics of Series:
- One-dimensional
- Contains values and index
- Supports vectorized operations
Accessing Data
import pandas as pd
data = [10, 20, 30, 40]
s = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(s['a']) # Access by label
print(s.iloc[0]) # Access by position
Output:
10
10
Operations on Series
import pandas as pd
data = [10, 20, 30, 40]
s = pd.Series(data, index=['a', 'b', 'c', 'd'])
add = s + 10
print(add)
mul = s * 2
print(mul)
Output:
a 20
b 30
c 40
d 50
dtype: int64
a 20
b 40
c 60
d 80
dtype: int64
These operations are element-wise.
3. Understanding Pandas DataFrame
A DataFrame is a two-dimensional labeled data structure with rows and columns, similar to a table or spreadsheet.
Creating a DataFrame
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 John 25
1 Alice 30
2 Bob 22
Key Characteristics:
- Two-dimensional
- Columns can have different data types
- Labeled axes (rows and columns)
4. Accessing Data in DataFrame
Column Selection
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)
col = df['Name']
print(col)
Output:
0 John
1 Alice
2 Bob
Name: Name, dtype: str
Multiple Columns
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)
mul_col = df[['Name', 'Age']]
print(mul_col)
Output:
Name Age
0 John 25
1 Alice 30
2 Bob 22
Row Selection
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)
by_label = df.loc[0] # By label
print(by_label)
by_row = df.iloc[0] # By index
print(by_row)
Output:
Name John
Age 25
Name: 0, dtype: object
Name John
Age 25
Name: 0, dtype: object
5. DataFrame Operations
Adding a New Column
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)
df['Salary'] = [50000, 60000, 45000] # Adding a new column
print(df)
Output:
Name Age Salary
0 John 25 50000
1 Alice 30 60000
2 Bob 22 45000
Deleting a Column
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)
df.drop('Age', axis=1, inplace=True) # Deleting a column
print(df)
Output:
Name
0 John
1 Alice
2 Bob
Filtering Data
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)
filtered = df[df['Age'] > 25] # Filtering
print(filtered)
Output:
Name Age
1 Alice 30
6. Handling Missing Data
Pandas provides powerful tools to deal with missing values.
Sample DataFrame with Missing Values
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Rahul', 'Sita', 'Geeta'],
'Age': [25, np.nan, 30, np.nan],
'Marks': [80, 90, np.nan, 70]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Marks
0 Amit 25.0 80.0
1 Rahul NaN 90.0
2 Sita 30.0 NaN
3 Geeta NaN 70.0
1. df.isnull() โ Detect Missing Values
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Rahul', 'Sita', 'Geeta'],
'Age': [25, np.nan, 30, np.nan],
'Marks': [80, 90, np.nan, 70]
}
df = pd.DataFrame(data)
print(print(df.isnull()))
Output:
Name Age Marks
0 False False False
1 False True False
2 False False True
3 False True False
None
2. df.dropna() โ Remove Rows with Missing Values
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Rahul', 'Sita', 'Geeta'],
'Age': [25, np.nan, 30, np.nan],
'Marks': [80, 90, np.nan, 70]
}
df = pd.DataFrame(data)
print(df.dropna())
Output:
Name Age Marks
0 Amit 25.0 80.0
3. df.fillna(0) โ Replace Missing Values
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Rahul', 'Sita', 'Geeta'],
'Age': [25, np.nan, 30, np.nan],
'Marks': [80, 90, np.nan, 70]
}
df = pd.DataFrame(data)
print(df.fillna(0))
Output:
Name Age Marks
0 Amit 25.0 80.0
1 Rahul 0.0 90.0
2 Sita 30.0 0.0
3 Geeta 0.0 70.0
3. df.fillna(0) โ Replace Missing Values
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Rahul', 'Sita', 'Geeta'],
'Age': [25, np.nan, 30, np.nan],
'Marks': [80, 90, np.nan, 70]
}
df = pd.DataFrame(data)
print(df.fillna(0))
Output:
Name Age Marks
0 Amit 25.0 80.0
1 Rahul 0.0 90.0
2 Sita 30.0 0.0
3 Geeta 0.0 70.0
7. Indexing and Selection
Setting Index
df.set_index('Name', inplace=True)
Resetting Index
df.reset_index(inplace=True)
8. Data Aggregation and Grouping
Grouping allows splitting data into groups and applying functions.
import pandas as pd
data = {
'Name': ['Amit', 'Rahul', 'Sita', 'Geeta'],
'Age': [25, 25, 30, 40],
'Marks': [80, 90, 90, 70]
}
df = pd.DataFrame(data)
group_by_age_sum = df.groupby('Age').sum()
print(group_by_age_sum)
Output:
Name Marks
Age
25 AmitRahul 170
30 Sita 90
40 Geeta 70
Common aggregation functions:
- sum()
- mean()
- count()
- min(), max()
9. Reading and Writing Data
Reading Files
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')
Writing Files
df.to_csv('output.csv')
df.to_excel('output.xlsx')
10. Difference Between Series and DataFrame
| Feature |
Series |
DataFrame |
| Dimension |
One-dimensional |
Two-dimensional |
| Structure |
Single column |
Multiple columns |
| Data Types |
Homogeneous or mixed |
Mixed across columns |
| Use Case |
Simple data |
Complex datasets |
Understanding Pandas Series and DataFrames is essential for anyone working in data science, machine learning, or data analysis. A Series provides a simple way to handle single-dimensional data, while a DataFrame offers a powerful structure for handling complex, tabular datasets.
Join the discussion