Day 6 of 180 - Numpy & Pandas

March 25, 2026 19 minute read

Part of my 180-day AI Engineering journey - learning in public, one hour a day, writing everything in plain English so beginners can follow along. The blog is written with the help of AI

Introduction: Why This Matters for AI

Before we can train any AI model, we need to feed it data. But real-world data is messy:

Customer records scattered across multiple files
Duplicate entries and missing values
Numbers in the wrong format
Thousands (or millions) of rows to process

NumPy and Pandas are the tools that handle this. They let you:

Load data from CSV files
Clean and transform it
Filter and group it
Calculate statistics
Merge datasets together

Every AI project starts here. You can’t train a model without data, and you can’t work with data without NumPy and Pandas.

Setup

Install the required libraries:

pip install numpy==1.26.3 pandas==2.1.3

Verify installation:

import numpy as np
import pandas as pd
 
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

Part 1: NumPy - Fast Arrays

The Problem: Speed

Think about a simple task: you have a list of 1 million student scores, and you need to multiply each one by 1.1 (like adding a 10% bonus).

# Python list way (SLOW)
scores = [85, 92, 78, 88, ...]  # 1 million items
bonused = []
for score in scores:
    bonused.append(score * 1.1)  # loops 1 million times

This takes time because Python has to:

Start a loop
Access each item individually
Multiply it
Append to a new list
Repeat 1 million times

NumPy solves this:

import numpy as np
 
scores = np.array([85, 92, 78, 88, ...])  # 1 million items
bonused = scores * 1.1  # INSTANT - happens all at once!

NumPy does the multiplication on the entire array at the CPU level, not in Python. This is 50-100x faster.

What is an Array?

An array is like a spreadsheet column (or grid) of numbers:

import numpy as np
 
# 1D array (like a column)
scores = np.array([85, 92, 78, 88])
# [85 92 78 88]
 
# 2D array (like a spreadsheet)
grades = np.array([
    [85, 88, 92],  # Alice's math, science, english
    [92, 85, 88],  # Bob's scores
    [78, 92, 85]   # Charlie's scores
])
# [[85 88 92]
#  [92 85 88]
#  [78 92 85]]

Creating Arrays

import numpy as np
 
# From a Python list
arr = np.array([1, 2, 3, 4, 5])
# [1 2 3 4 5]
 
# Zeros (useful for initializing)
zeros = np.zeros(5)
# [0. 0. 0. 0. 0.]
 
# Ones
ones = np.ones((3, 2))
# [[1. 1.]
#  [1. 1.]
#  [1. 1.]]
 
# Range (like Python's range(), but returns an array)
range_arr = np.arange(0, 10, 2)
# [0 2 4 6 8]
 
# Evenly spaced numbers (give me 5 numbers between 0 and 1)
linspace = np.linspace(0, 1, 5)
# [0.   0.25 0.5  0.75 1.  ]
 
# Random numbers (for testing)
np.random.seed(42)  # reproducible randomness
random_arr = np.random.randn(3, 3)
# [[ 0.49671415 -0.1382643   0.64589411]
#  [-0.23415337 -0.23413696  1.57921282]
#  [ 0.76743473 -0.46947439  0.54256004]]

Array Properties

arr = np.array([[1, 2, 3], [4, 5, 6]])
 
# How many rows and columns?
arr.shape  # (2, 3) - 2 rows, 3 columns
 
# What type of data?
arr.dtype  # dtype('int64') - integer, 64-bit
 
# How many dimensions?
arr.ndim   # 2 - it's a table (2D)
 
# Total number of elements?
arr.size   # 6
 
# Flip rows and columns
arr.T
# [[1 4]
#  [2 5]
#  [3 6]]

Indexing: Getting Specific Elements

scores = np.array([85, 92, 78, 88, 90])
 
# Get the first score
scores[0]  # 85
 
# Get the last score
scores[-1]  # 90
 
# Get scores from position 1 to 3 (not including 4)
scores[1:4]  # [92, 78, 88]
 
# Get every other score
scores[::2]  # [85, 78, 90]
 
# For 2D arrays
grades = np.array([[85, 88, 92], [92, 85, 88]])
 
# Get Alice's science score (row 0, column 1)
grades[0, 1]  # 88
 
# Get Charlie's entire row
grades[1, :]  # [92, 85, 88]
 
# Get all math scores (entire column 0)
grades[:, 0]  # [85, 92]

Boolean Indexing: Filtering

This is incredibly useful. Get only the values that match a condition:

scores = np.array([85, 92, 78, 88, 92])
 
# Which scores are 85 or higher?
high_scores = scores[scores >= 85]
# [85, 92, 88, 92]
 
# This creates a True/False array first:
scores >= 85
# [True, True, False, True, True]
 
# Then keeps only the True ones

Broadcasting: The Magic Trick

Broadcasting is NumPy’s ability to make arrays of different sizes work together. Think of it like stretching a small image to fit a big frame:

# Simple example: adding 10 to every score
scores = np.array([80, 85, 90])
scores + 10
# [90, 95, 100]
 
# More complex: adding the same bonus to each subject
math_scores = np.array([85, 92, 78])
science_scores = np.array([88, 85, 92])
english_scores = np.array([92, 88, 85])
 
# Stack them into a table
grades = np.array([math_scores, science_scores, english_scores])
# [[85 92 78]
#  [88 85 92]
#  [92 88 85]]
 
# Add 5 points bonus to all grades
boosted = grades + 5
# [[90 97 83]
#  [93 90 97]
#  [97 93 90]]
 
# The 5 gets "broadcast" to all positions!

Vectorized Operations: The Real Power

Instead of loops, use NumPy functions:

temps = np.array([72.5, 73.1, 71.8, 74.2, 72.9])
 
# Average temperature
np.mean(temps)  # 72.9
 
# Standard deviation (how spread out are they?)
np.std(temps)   # ~0.94
 
# Total temperature (if we added them all up)
np.sum(temps)   # 364.5
 
# Highest temperature
np.max(temps)   # 74.2
 
# Position of highest (0-indexed)
np.argmax(temps)  # 3
 
# Lowest temperature
np.min(temps)   # 71.8
 
# Distance from mean (how far is each from average?)
deviations = temps - np.mean(temps)
# [ -0.4  0.2  -1.1  1.3  0. ]
 
# Absolute values (ignore + or -)
abs_deviations = np.abs(deviations)
# [0.4, 0.2, 1.1, 1.3, 0.]
 
# Square each one (for variance calculation)
squared = temps ** 2
# [5256.25, 5343.21, 5155.24, 5505.84, 5314.41]
 
# Square root
square_root = np.sqrt(squared)
 
# Dot product (multiply matching elements and sum)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32

Reshaping: Changing Shape Without Losing Data

arr = np.arange(12)  # [0, 1, 2, ..., 11]
 
# Reshape to a 3x4 grid
grid = arr.reshape(3, 4)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]
 
# Flatten back to 1D
flat = grid.flatten()
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
 
# Transpose (swap rows and columns)
transposed = grid.T
# [[ 0  4  8]
#  [ 1  5  9]
#  [ 2  6 10]
#  [ 3  7 11]]

Stacking: Combining Arrays

# Stack horizontally (side by side)
a = np.array([[1], [2]])
b = np.array([[3], [4]])
 
np.hstack([a, b])
# [[1 3]
#  [2 4]]
 
# Stack vertically (on top)
np.vstack([a, b])
# [[1]
#  [2]
#  [3]
#  [4]]
 
# Concatenate along an axis
np.concatenate([a, b], axis=0)  # vertical
np.concatenate([a, b], axis=1)  # horizontal

Useful Utility Functions

scores = np.array([85, 92, 78, 88, 92])
 
# Clip: constrain values to a range
# (if someone scores above 95, cap it at 95)
clipped = np.clip(scores, 0, 95)
# [85, 92, 78, 88, 92]
 
# Where: conditional operation
# (if score < 80, add 10; otherwise keep it)
adjusted = np.where(scores < 80, scores + 10, scores)
# [85, 92, 88, 88, 92]
 
# Unique: find all unique values
np.unique(scores)
# [78, 85, 88, 92]
 
# Linear algebra basics
A = np.array([[1, 2], [3, 4]])
 
# Determinant (single number describing the matrix)
np.linalg.det(A)  # -2.0
 
# Inverse (the "opposite" of a matrix)
np.linalg.inv(A)
# [[-2.   1. ]
#  [ 1.5 -0.5]]
 
# Norm (length/magnitude)
np.linalg.norm(A)  # 5.477

Part 2: Pandas - Working with Real Data

From Arrays to Tables

NumPy is great for math. But real data has column names and row labels. That’s where Pandas comes in.

A Pandas DataFrame is like an Excel spreadsheet:

Rows are observations (students)
Columns are properties (name, math score, science score)
Each cell has a value

Series: A Single Column

import pandas as pd
 
# Create a Series (like a column)
scores = pd.Series([85, 92, 78, 88], name='Math Scores')
# 0    85
# 1    92
# 2    78
# 3    88
# Name: Math Scores, dtype: int64
 
# Access by position
scores[0]  # 85
 
# With custom index (row labels)
scores = pd.Series(
    [85, 92, 78, 88],
    index=['Alice', 'Bob', 'Charlie', 'Diana'],
    name='Math Scores'
)
# Alice        85
# Bob          92
# Charlie      78
# Diana        88
 
# Access by label
scores['Alice']  # 85
 
# All operations work like NumPy
scores.mean()     # 85.75
scores.max()      # 92
scores.min()      # 78
scores.std()      # ~6.23
 
# Filter (keep scores above 85)
scores[scores > 85]
# Bob        92
# Diana      88

DataFrames: The Real Thing

A DataFrame is multiple Series stacked side by side:

import pandas as pd
 
# Create from a dictionary (most common)
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Grade': [10, 10, 11, 11],
    'Math': [85, 92, 78, 88],
    'Science': [88, 85, 92, 90],
    'English': [92, 88, 85, 89]
}
 
df = pd.DataFrame(data)
#      Name Grade Math Science English
# 0   Alice    10   85      88      92
# 1     Bob    10   92      85      88
# 2 Charlie    11   78      92      85
# 3   Diana    11   88      90      89

Exploring Your Data

# How big is it?
df.shape  # (4, 5) - 4 rows, 5 columns
 
# What are the column names?
df.columns  # ['Name', 'Grade', 'Math', 'Science', 'English']
 
# What are the data types?
df.dtypes
# Name       object (text)
# Grade       int64 (integer)
# Math        int64 (integer)
# Science     int64 (integer)
# English     int64 (integer)
 
# First few rows
df.head(2)
#      Name Grade Math Science English
# 0   Alice    10   85      88      92
# 1     Bob    10   92      85      88
 
# Last few rows
df.tail(1)
#      Name Grade Math Science English
# 3   Diana    11   88      90      89
 
# Summary statistics
df.describe()
#        Grade  Math  Science  English
# count    4.0    4.0      4.0      4.0
# mean    10.5   85.75     89.0     89.0
# std      0.577  6.179     3.559    2.708
# min     10.0   78.0     85.0     85.0
# 25%     10.0   82.5     87.0     86.5
# 50%     10.5   86.5     89.0     89.0
# 75%     11.0   89.5     91.0     90.5
# max     11.0   92.0     92.0     92.0
 
# Detailed info
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (but 4 non-null):
# Name       4 non-null object
# Grade      4 non-null int64
# Math       4 non-null int64
# Science    4 non-null int64
# English    4 non-null int64

Getting Data: Columns, Rows, Cells

# Get a single column (returns a Series)
df['Name']
# 0      Alice
# 1        Bob
# 2    Charlie
# 3      Diana
 
# Get multiple columns (returns a DataFrame)
df[['Name', 'Math']]
#      Name Math
# 0   Alice   85
# 1     Bob   92
# 2 Charlie   78
# 3   Diana   88
 
# Get a row by position (iloc = integer location)
df.iloc[0]
# Name              Alice
# Grade                10
# Math                 85
# Science              88
# English              92
 
# Get a specific cell by position
df.iloc[0, 0]  # Alice
 
# Get by label (loc = location by label)
df.loc[0, 'Name']  # Alice
 
# Get all rows where condition is true
df[df['Math'] > 85]
#      Name Grade Math Science English
# 1     Bob    10   92      85      88
# 3   Diana    11   88      90      89
 
# Multiple conditions
df[(df['Math'] > 85) & (df['Science'] > 85)]
#      Name Grade Math Science English
# 1     Bob    10   92      85      88
# 3   Diana    11   88      90      89

Adding Columns

# Calculate average score
df['Average'] = (df['Math'] + df['Science'] + df['English']) / 3
 
df
#      Name Grade Math Science English  Average
# 0   Alice    10   85      88      92    88.33
# 1     Bob    10   92      85      88    88.33
# 2 Charlie    11   78      92      85    85.00
# 3   Diana    11   88      90      89    89.00
 
# Add a column with apply (custom function)
def grade_letter(score: float) -> str:
    if score >= 90: return 'A'
    elif score >= 80: return 'B'
    else: return 'C'
 
df['GradeLetter'] = df['Average'].apply(grade_letter)
 
df
#      Name Grade Math Science English  Average GradeLetter
# 0   Alice    10   85      88      92    88.33           B
# 1     Bob    10   92      85      88    88.33           B
# 2 Charlie    11   78      92      85    85.00           B
# 3   Diana    11   88      90      89    89.00           B

Removing Columns

# Drop a column
df = df.drop('GradeLetter', axis=1)
 
# Keep only certain columns
df = df[['Name', 'Math', 'Science', 'English']]

GroupBy: Organize, Then Calculate

GroupBy is like taking a stack of papers, sorting them by grade level, then analyzing each pile:

# What's the average math score per grade level?
df.groupby('Grade')['Math'].mean()
# Grade
# 10    88.5
# 11    83.0
 
# Multiple subjects
df.groupby('Grade')[['Math', 'Science', 'English']].mean()
#       Math  Science  English
# Grade
# 10    88.5     86.5     90.0
# 11    83.0     91.0     87.0
 
# More aggregations
df.groupby('Grade').agg({
    'Math': ['mean', 'max', 'min'],
    'Science': ['mean', 'max', 'min'],
    'Name': 'count'  # how many students per grade
})
#        Math            Science         Name
#       mean max min       mean max min count
# Grade
# 10     88.5  92  85       86.5  88  85     2
# 11     83.0  88  78       91.0  92  90     2
 
# Count occurrences
df.groupby('Grade').size()
# Grade
# 10    2
# 11    2

Merge: Combining Two DataFrames

Imagine you have student scores in one file and student names in another. Merge combines them:

# File 1: Scores
scores_df = pd.DataFrame({
    'StudentID': [1, 2, 3],
    'Math': [85, 92, 78]
})
#    StudentID Math
# 0          1   85
# 1          2   92
# 2          3   78
 
# File 2: Names
names_df = pd.DataFrame({
    'StudentID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})
#    StudentID     Name
# 0          1    Alice
# 1          2      Bob
# 2          3  Charlie
 
# INNER join (only keep rows in BOTH files)
merged = pd.merge(scores_df, names_df, on='StudentID')
#    StudentID Math     Name
# 0          1   85    Alice
# 1          2   92      Bob
# 2          3   78  Charlie
 
# LEFT join (keep all from left, add matching from right)
pd.merge(scores_df, names_df, on='StudentID', how='left')
# (same result if all IDs match)
 
# RIGHT join (keep all from right, add matching from left)
pd.merge(scores_df, names_df, on='StudentID', how='right')
 
# OUTER join (keep all rows from both)
pd.merge(scores_df, names_df, on='StudentID', how='outer')

Apply: Run Functions on Your Data

def bonus_points(score: float) -> float:
    """Add 5% bonus to score."""
    return score * 1.05
 
# Apply to a column
df['Math_Bonus'] = df['Math'].apply(bonus_points)
 
# With lambda (inline function)
df['Math2x'] = df['Math'].apply(lambda x: x * 2)
 
# Apply to entire row
def calc_total(row) -> float:
    """Add up all three subjects."""
    return row['Math'] + row['Science'] + row['English']
 
df['Total'] = df.apply(calc_total, axis=1)
 
df
#      Name Grade Math Science English  Math_Bonus  Math2x  Total
# 0   Alice    10   85      88      92       89.25     170    265
# 1     Bob    10   92      85      88       96.60     184    265
# 2 Charlie    11   78      92      85       81.90     156    255
# 3   Diana    11   88      90      89       92.40     176    267

String Operations

names = pd.Series(['Alice', 'Bob', 'Charlie'])
 
# Convert to lowercase
names.str.lower()
# 0      alice
# 1        bob
# 2    charlie
 
# Convert to uppercase
names.str.upper()
# 0      ALICE
# 1        BOB
# 2    CHARLIE
 
# Length of each string
names.str.len()
# 0    5
# 1    3
# 2    7
 
# Check if contains substring
names.str.contains('li')
# 0     True
# 1    False
# 2     True
 
# Replace characters
names.str.replace('a', 'A')
# 0     Alice
# 1       Bob
# 2    ChArlie
 
# Split by character
names.str.split('i')
# 0    [Al, ce]
# 1       [Bob]
# 2    [Charl, e]

Datetime Operations

dates = pd.to_datetime(['2024-01-15', '2024-06-20', '2024-12-25'])
# DatetimeIndex(['2024-01-15', '2024-06-20', '2024-12-25'],
#                dtype='datetime64[ns]', freq=None)
 
# Extract parts
dates.dt.year
# 2024, 2024, 2024
 
dates.dt.month
# 1, 6, 12
 
dates.dt.day
# 15, 20, 25
 
dates.dt.day_name()
# 'Monday', 'Thursday', 'Monday'
 
# Calculate differences
(dates.max() - dates.min()).days
# 345 (days between first and last)

Handling Missing Data

# Create data with missing values
data_with_nan = {
    'Name': ['Alice', 'Bob', None, 'Diana'],
    'Score': [85, None, 92, 88]
}
df_nan = pd.DataFrame(data_with_nan)
#      Name Score
# 0   Alice   85.0
# 1     Bob    NaN
# 2    None   92.0
# 3   Diana   88.0
 
# Check where data is missing
df_nan.isna()
#      Name Score
# 0  False False
# 1  False  True
# 2   True False
# 3  False False
 
# Drop rows with any missing values
df_clean = df_nan.dropna()
#      Name Score
# 0   Alice   85.0
# 3   Diana   88.0
 
# Drop rows where Score is missing
df_partial = df_nan.dropna(subset=['Score'])
#      Name Score
# 0   Alice   85.0
# 2    None   92.0
# 3   Diana   88.0
 
# Fill missing with a value
df_filled = df_nan.fillna(0)
#      Name Score
# 0   Alice   85.0
# 1     Bob    0.0
# 2       0   92.0
# 3   Diana   88.0
 
# Fill with average
df_filled = df_nan.copy()
df_filled['Score'].fillna(df_nan['Score'].mean(), inplace=True)
#      Name Score
# 0   Alice   85.0
# 1     Bob   88.333333
# 2    None   92.0
# 3   Diana   88.0

Save & Load CSV

# Save to CSV
df.to_csv('students.csv', index=False)
 
# Load from CSV
df = pd.read_csv('students.csv')
 
# With options
df = pd.read_csv('students.csv', 
                  delimiter=',',      # column separator
                  index_col=0,        # use first column as row index
                  dtype={'Grade': int}) # specify data types

The Project: Student Score Analysis Pipeline

Now let’s build a complete, professional analysis pipeline with type hints on every function.

Step 1: Create Sample Data

Create a file named data.csv:

Name,Grade,Math,Science,English
Alice,10,85,88,92
Bob,10,92,85,88
Charlie,11,78,92,85
Diana,11,88,90,89
Eve,12,92,95,91
Frank,12,88,87,86
Grace,10,90,89,91
Henry,11,85,88,90

Step 2: Build the Analysis Pipeline

Create a file named analysis.py:

import numpy as np
import pandas as pd
from typing import Dict, Any
 
def load_data(filepath: str) -> pd.DataFrame:
    """Load student data from CSV file.
    
    Args:
        filepath: Path to CSV file
        
    Returns:
        DataFrame with student data
    """
    return pd.read_csv(filepath)
 
 
def calculate_overall_score(df: pd.DataFrame) -> pd.DataFrame:
    """Add overall average score for each student.
    
    Uses NumPy vectorization for efficiency.
    
    Args:
        df: DataFrame with Math, Science, English columns
        
    Returns:
        DataFrame with new 'Overall' column
    """
    # Get the score columns as a NumPy array
    scores = df[['Math', 'Science', 'English']].values
    
    # Calculate mean across each row (axis=1)
    overall = np.mean(scores, axis=1)
    
    # Add as new column
    df['Overall'] = overall
    
    return df
 
 
def assign_grades(df: pd.DataFrame) -> pd.DataFrame:
    """Assign letter grades based on overall score.
    
    Args:
        df: DataFrame with 'Overall' column
        
    Returns:
        DataFrame with new 'GradeLetter' column
    """
    def grade_letter(score: float) -> str:
        """Convert numeric score to letter grade."""
        if score >= 90:
            return 'A'
        elif score >= 80:
            return 'B'
        else:
            return 'C'
    
    # Apply the function to the Overall column
    df['GradeLetter'] = df['Overall'].apply(grade_letter)
    
    return df
 
 
def grade_level_analysis(df: pd.DataFrame) -> pd.DataFrame:
    """Analyze performance by grade level.
    
    Args:
        df: DataFrame with Grade column and scores
        
    Returns:
        DataFrame with statistics per grade level
    """
    stats = df.groupby('Grade').agg({
        'Math': ['mean', 'max', 'min'],
        'Science': ['mean', 'max', 'min'],
        'English': ['mean', 'max', 'min'],
        'Overall': 'mean',
        'Name': 'count'  # number of students
    }).round(2)
    
    return stats
 
 
def top_students(df: pd.DataFrame, n: int = 3) -> pd.DataFrame:
    """Get top N students by overall score.
    
    Args:
        df: DataFrame with student data
        n: Number of top students to return
        
    Returns:
        DataFrame with top N students
    """
    return df.nlargest(n, 'Overall')[['Name', 'Grade', 'Overall', 'GradeLetter']]
 
 
def subject_stats(df: pd.DataFrame) -> Dict[str, Dict[str, float]]:
    """Calculate statistics for each subject.
    
    Uses NumPy for efficient calculation.
    
    Args:
        df: DataFrame with Math, Science, English columns
        
    Returns:
        Dictionary with stats for each subject
    """
    subjects = ['Math', 'Science', 'English']
    stats = {}
    
    for subject in subjects:
        # Convert to NumPy array for calculation
        scores = df[subject].values
        
        stats[subject] = {
            'mean': float(np.mean(scores)),
            'std': float(np.std(scores)),
            'min': float(np.min(scores)),
            'max': float(np.max(scores))
        }
    
    return stats
 
 
def correlations(df: pd.DataFrame) -> np.ndarray:
    """Calculate correlation matrix between subjects.
    
    Uses NumPy correlation.
    
    Args:
        df: DataFrame with Math, Science, English columns
        
    Returns:
        Correlation matrix (3x3 array)
    """
    subject_data = df[['Math', 'Science', 'English']].values
    return np.corrcoef(subject_data.T)
 
 
def print_correlation_heatmap(corr_matrix: np.ndarray) -> None:
    """Print correlation matrix as ASCII table.
    
    Args:
        corr_matrix: 3x3 correlation matrix
    """
    subjects = ['Math', 'Science', 'English']
    
    print("\nCORRELATION MATRIX:")
    print("        Math  Science English")
    
    for i, subj in enumerate(subjects):
        print(f"{subj:7s} ", end="")
        for j in range(len(subjects)):
            val = corr_matrix[i, j]
            print(f"{val:6.2f} ", end="")
        print()
 
 
def top_n_per_subject(df: pd.DataFrame, n: int = 2) -> Dict[str, pd.DataFrame]:
    """Return top N students for each subject.
    
    Args:
        df: DataFrame with student data
        n: Number of top students per subject
        
    Returns:
        Dictionary mapping subject to top N students
    """
    subjects = ['Math', 'Science', 'English']
    results = {}
    
    for subject in subjects:
        top = df.nlargest(n, subject)[['Name', subject]]
        results[subject] = top
    
    return results
 
 
def pivot_by_grade(df: pd.DataFrame) -> pd.DataFrame:
    """Create pivot table of average scores by grade and subject.
    
    Args:
        df: DataFrame with Grade and subject columns
        
    Returns:
        Pivot table with grades as rows, subjects as columns
    """
    pivot = df.pivot_table(
        values=['Math', 'Science', 'English'],
        index='Grade',
        aggfunc='mean'
    ).round(2)
    
    return pivot
 
 
def main() -> None:
    """Run the complete analysis pipeline."""
    print("=" * 70)
    print("STUDENT SCORE ANALYSIS PIPELINE")
    print("=" * 70)
    
    # Step 1: Load and process data
    df = load_data('data.csv')
    df = calculate_overall_score(df)
    df = assign_grades(df)
    
    # Step 2: Display all students with grades
    print("\n1. ALL STUDENTS WITH GRADES:")
    print(df[['Name', 'Grade', 'Math', 'Science', 'English', 
              'Overall', 'GradeLetter']].to_string(index=False))
    
    # Step 3: Grade level analysis
    print("\n2. STATISTICS BY GRADE LEVEL:")
    print(grade_level_analysis(df))
    
    # Step 4: Top 3 students
    print("\n3. TOP 3 STUDENTS:")
    print(top_students(df, n=3).to_string(index=False))
    
    # Step 5: Subject statistics
    print("\n4. SUBJECT STATISTICS:")
    stats = subject_stats(df)
    for subject, vals in stats.items():
        print(f"\n{subject}:")
        print(f"  Mean: {vals['mean']:.2f}")
        print(f"  Std:  {vals['std']:.2f}")
        print(f"  Min:  {vals['min']:.1f}")
        print(f"  Max:  {vals['max']:.1f}")
    
    # Step 6: Correlations
    print("\n5. SUBJECT CORRELATIONS:")
    corr_matrix = correlations(df)
    print_correlation_heatmap(corr_matrix)
    
    # Step 7: Top students per subject
    print("\n6. TOP 2 STUDENTS PER SUBJECT:")
    top_per_subj = top_n_per_subject(df, n=2)
    for subject, top_df in top_per_subj.items():
        print(f"\n{subject}:")
        print(top_df.to_string(index=False))
    
    # Step 8: Pivot table
    print("\n7. AVERAGE SCORES BY GRADE LEVEL:")
    print(pivot_by_grade(df))
    
    # Step 9: Save results
    df.to_csv('results.csv', index=False)
    print("\n✓ Results saved to results.csv")
 
 
if __name__ == '__main__':
    main()

Step 3: Run the Pipeline

python analysis.py

Expected Output

======================================================================
STUDENT SCORE ANALYSIS PIPELINE
======================================================================
 
1. ALL STUDENTS WITH GRADES:
     Name Grade  Math Science English  Overall GradeLetter
    Alice    10    85       88      92    88.33           B
      Bob    10    92       85      88    88.33           B
  Charlie    11    78       92      85    85.00           B
    Diana    11    88       90      89    89.00           B
      Eve    12    92       95      91    92.67           A
    Frank    12    88       87      86    87.00           B
    Grace    10    90       89      91    90.00           A
    Henry    11    85       88      90    87.67           B
 
2. STATISTICS BY GRADE LEVEL:
       Math            Science         English       Overall Name
      mean max min      mean max min     mean max min   mean count
Grade
10     87.25  92  85    86.75  89  88   91.25  92  91  88.41    4
11     83.00  88  78    91.00  92  90   87.00  89  85  87.00    2
12     90.00  92  88    91.00  95  87   88.50  91  86  89.83    2
 
3. TOP 3 STUDENTS:
     Name Grade  Overall GradeLetter
      Eve    12    92.67           A
    Grace    10    90.00           A
    Diana    11    89.00           B
 
4. SUBJECT STATISTICS:
 
Math:
  Mean: 87.13
  Std:  5.32
  Min:  78.0
  Max:  92.0
 
Science:
  Mean: 89.25
  Std:  4.27
  Min:  85.0
  Max:  95.0
 
English:
  Mean: 89.00
  Std:  2.93
  Min:  85.0
  Max:  92.0
 
5. SUBJECT CORRELATIONS:
        Math  Science English
Math     1.00    0.35    0.52
Science  0.35    1.00   -0.13
English  0.52   -0.13    1.00
 
6. TOP 2 STUDENTS PER SUBJECT:
 
Math:
    Name Score
     Eve    92
     Bob    92
 
Science:
     Name Score
      Eve    95
  Charlie    92
 
English:
    Name Score
   Alice    92
    Eve    91
 
7. AVERAGE SCORES BY GRADE LEVEL:
         Math  Science  English
Grade
10     87.25    86.75    91.25
11     83.00    91.00    87.00
12     90.00    91.00    88.50
 
✓ Results saved to results.csv

What’s Next: Day 7

Data Visualization: Matplotlib + Seaborn + Logging Module

On Day 7, you’ll learn:

Matplotlib - create line plots, bar charts, scatter plots
Seaborn - make beautiful statistical visualizations
Logging module - replace print() with professional logging

Share on

X Facebook LinkedIn Bluesky

Edward

Part of my 180-day AI Engineering journey - learning in public, one hour a day, writing everything in plain English so beginners can follow along. The blog is written with the help of AI