Skip to main content

Command Palette

Search for a command to run...

Stroke Risk Analysis in Nigeria: A Beginner's Data Journey

Updated
6 min read
Stroke Risk Analysis in Nigeria: A Beginner's Data Journey

Introduction

Ever looked at a dataset and wondered, "What story is hiding in these numbers?" That's exactly what happened when I found stroke data from Nigeria on Hugging Face. As a data analytics beginner, I wanted to tackle something real – understanding what puts people at risk for strokes.

This article walks through how I went from a messy CSV to an interactive Tableau dashboard, the challenges I faced, and what I learned along the way.

The Project Goal

Build an interactive dashboard analyzing stroke risk factors in Nigerian patients, focusing on:

  • Medical conditions (hypertension, heart disease)

  • Demographics (age, gender, location)

  • Lifestyle factors (work type, smoking, marital status)

The Result: Six visualizations that tell a compelling story about stroke risk.

The Dataset

Source: Hugging Face
Records: Several thousand Nigerian patients
Columns:

  • Demographics: Age, Gender, Residence (Urban/Rural), Work Type

  • Medical: Hypertension, Heart Disease, Stroke (all binary 0/1)

  • Health Metrics: BMI, Average Glucose Level

  • Lifestyle: Smoking Status, Ever Married

The Challenge: Binary columns stored as 0/1 needed special handling for meaningful visualizations.

The Workflow

Step 1: Data Cleaning with Pandas

Started in a Jupyter Notebook because it's perfect for experimenting.

python

import pandas as pd

# Load the dataset
df = pd.read_csv('stroke_data.csv')

# Quick exploration
print(df.head())
print(df.info())
print(df.isnull().sum())

# Handle missing values
df = df.dropna()

# Ensure binary columns are integers
binary_cols = ['stroke', 'hypertension', 'heart_disease', 'ever_married']
for col in binary_cols:
    df[col] = df[col].astype(int)

# Create age groups for better visualization
df['age_group'] = pd.cut(df['age'], 
                         bins=[0, 18, 35, 50, 65, 100],
                         labels=['Child', 'Young Adult', 'Middle Age', 'Senior', 'Elderly'])

# Export cleaned data
df.to_csv('stroke_data_cleaned.csv', index=False)
```

**Key Cleaning Steps:**
- Removed rows with missing values
- Converted text columns to appropriate types
- Created age categories for analysis
- Verified data integrity

### Step 2: Into Tableau

**Loading Data:**
1. Open Tableau Public
2. Connect to the cleaned CSV
3. **Critical:** Verify data types in Data Source tab
   - Binary columns (0/1) should show as numbers (#)
   - If they're text (Abc), convert them

**Creating Calculated Fields:**

This is where the magic happens. You can't just drag 0/1 columns around – you need to calculate meaningful metrics.
```
Stroke Rate (%):
AVG([Stroke]) * 100
```

Why this works: Averaging 0s and 1s gives you the proportion of 1s (stroke cases), then multiply by 100 for percentage.
```
Risk Score:
[Hypertension] + [Heart Disease] + IF [BMI] > 30 THEN 1 ELSE 0 END
```

**Important:** The `IF [BMI] > 30 THEN 1 ELSE 0 END` is crucial because `[BMI] > 30` returns TRUE/FALSE, not a number. You can't add TRUE to integers.
```
Risk Category:
IF [Risk Score] >= 3 THEN "Very High Risk"
ELSEIF [Risk Score] = 2 THEN "High Risk"
ELSEIF [Risk Score] = 1 THEN "Medium Risk"
ELSE "Low Risk"
END

Step 3: Building Visualizations

Visualization 1: Heatmap (Hypertension × Heart Disease)

Goal: Show how these conditions interact

Setup:

  • Columns: Hypertension (blue pill/dimension)

  • Rows: Heart Disease (blue pill/dimension)

  • Color: AVG(Stroke) as percentage

  • Mark Type: Square

  • Size: COUNT() for sample size

Rookie Mistake: Kept getting a continuous axis instead of a 2×2 grid. Solution: Right-click the pill → Convert to Discrete.

Visualization 2: Risk Score Bar Chart

Setup:

  • Columns: Risk Category

  • Rows: AVG(Stroke) * 100

  • Color: Risk Category

  • Sort: Descending by stroke rate

  • Label: Show percentage and COUNT()

Other Visualizations

Following the same pattern, I created:

  • Age group analysis (bar chart)

  • Gender & marriage patterns (grouped bars)

  • Work type comparison (bar chart)

  • Urban vs rural (simple comparison)

Step 4: Dashboard Assembly

Layout Strategy:

  • Top row: Medical factors (heatmap, risk score)

  • Middle row: Demographics (age, gender/marriage)

  • Bottom row: Environmental (work, location)

Design Principles:

  • Leave white space

  • Consistent color scheme

  • Clear titles and labels

  • Add context in tooltips

Key Challenges (And Solutions)

Challenge 1: "Cannot mix aggregate and non-aggregate arguments"

Problem: Tried doing math with dimensions and measures incorrectly.

Solution: Used calculated fields to convert everything to the same type before calculations.

Challenge 2: Green Pills vs Blue Pills

Problem: Hypertension and Heart Disease showed as green (continuous) instead of blue (discrete).

Solution: Right-click → Convert to Discrete. Or drag them from Measures to Dimensions in the data pane.

Challenge 3: Tiny Unreadable Visualizations

Problem: Charts looked fine in edit mode but tiny in dashboard.

Solution:

  • Use containers to control sizing

  • Set minimum dimensions

  • Test on different screen sizes

Challenge 4: Making It Actually Useful

The Test: For each visualization, I asked:

  1. What question does this answer?

  2. Can someone understand it in 5 seconds?

  3. Does it add new information?

If I couldn't answer all three, I deleted it (even if it looked pretty).

Key Findings

1. The Double Trouble Effect

Patients with BOTH hypertension and heart disease had dramatically higher stroke rates than those with just one. It's multiplicative, not additive.

2. Age Progression

Stroke risk increases steadily with age, but younger patients (under 50) still had strokes – it's not just an "old person problem."

3. Geographic Disparities

Urban and rural areas showed different stroke rates, likely due to healthcare access, lifestyle, or detection differences.

4. Work Type Matters

Different occupations showed varying stroke rates, possibly related to stress, activity levels, and healthcare access.

5. Risk Stratification Works

Combining multiple factors into a risk score effectively identified the most vulnerable populations.

What I Learned

Technical Skills

  • Pandas: Data cleaning, type conversions, creating categories

  • Tableau: Calculated fields, different chart types, dashboard design

  • Problem-solving: Reading error messages, debugging visualizations

Data Analysis Skills

  • Asking the right questions

  • Choosing appropriate visualizations

  • Balancing detail with clarity

  • Understanding aggregations and what they mean

Tools That Saved Me

Learning:

  • Tableau Public tutorials

  • Pandas documentation

  • YouTube for specific techniques

Development:

  • Jupyter Notebook for exploration

  • Tableau Public for visualization

  • ColorBrewer for color schemes

  • Markdown for documentation

The Nigerian Context

This isn't just practice – it matters. Nigeria's healthcare system faces:

  • Limited funding (below international benchmarks)

  • High out-of-pocket payments (70-75%)

  • Geographic disparities in access

  • Underfunded primary care

Data-driven insights can help:

  • Target resources to high-risk groups

  • Guide public health campaigns

  • Inform policy decisions

  • Identify infrastructure needs

What's Next

Potential Improvements:

  • Add predictive modeling (machine learning)

  • Include temporal trends if data available

  • Build interactive web app for risk assessment

  • Expand to other cardiovascular conditions

Final Thoughts

The Secret? There isn't one. Just:

  1. Pick a project you care about

  2. Break it into tiny steps

  3. Google everything

  4. Make mistakes and learn

  5. Repeat

Your project won't be perfect. Mine isn't either. But done beats perfect, and started beats planning forever.

Useful Code Snippets

Data Cleaning Template

python

import pandas as pd

# Load and explore
df = pd.read_csv('data.csv')
print(df.info())
print(df.isnull().sum())

# Clean
df = df.dropna()
df['column'] = df['column'].astype(int)

# Export
df.to_csv('cleaned_data.csv', index=False)
```

### Tableau Calculated Fields
```
# Stroke Rate
AVG([Stroke]) * 100

# Risk Score  
[Hypertension] + [Heart Disease] + IF [BMI] > 30 THEN 1 ELSE 0 END

# Boolean Conversion
IF [Condition] THEN 1 ELSE 0 END

Remember: Every expert was once a beginner. The only difference? They didn't give up. You've got this! 🚀

GitHub: Tracy Ouma

Tableau Dashboard: https://public.tableau.com/views/Book3_17624273398250/StrokeAnalysisinNigeria2?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link

Website link: https://nigerian-stroke-insights.lovable.app