Python and Pandas for Research Data Processing: 5 Essential Operations

2022年3月7日 English Articles, Germany Life

Pandas is the standard Python library for working with tabular data in research contexts. If you already know basic Python, five operations will handle the majority of data cleaning and analysis tasks you encounter in a typical lab.

1. Loading Data

import pandas as pd

df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df.head()  # preview first 5 rows
df.info()  # column types and missing value counts

Always call df.info() immediately after loading. It shows you column names, data types, and how many non-null values each column has. Unexpected nulls or wrong dtypes cause silent errors downstream.

2. Filtering Rows and Selecting Columns

# Select one column
ages = df['age']

# Filter rows
adults = df[df['age'] >= 18]

# Multiple conditions
filtered = df[(df['age'] >= 18) & (df['group'] == 'control')]

# Select multiple columns
subset = df[['subject_id', 'age', 'score']]

3. Handling Missing Values

# Count missing values per column
df.isnull().sum()

# Drop rows with any missing value
df_clean = df.dropna()

# Fill with a value or strategy
df['score'] = df['score'].fillna(df['score'].mean())

Do not fill missing values blindly — decide whether a null represents “not measured” or “zero” in your specific context. The difference affects statistical interpretation.

4. Grouping and Aggregation

# Mean score by group
summary = df.groupby('group')['score'].mean()

# Multiple aggregations at once
summary = df.groupby('group').agg(
    mean_score=('score', 'mean'),
    n=('subject_id', 'count'),
    std_score=('score', 'std')
)

5. Visualizing with Matplotlib

import matplotlib.pyplot as plt

# Histogram
df['score'].hist(bins=20)
plt.xlabel('Score')
plt.ylabel('Count')
plt.title('Score Distribution')
plt.savefig('score_hist.png', dpi=150, bbox_inches='tight')
plt.show()

# Bar plot from grouped summary
summary['mean_score'].plot(kind='bar')
plt.tight_layout()
plt.savefig('group_means.png', dpi=150)
plt.show()

Use savefig() before show() — calling show() first clears the figure in some environments. For publication-quality figures, increase dpi to 300 and use vector formats (pdf or svg) via savefig('figure.pdf').