Essential EDAs For Data Scientist

Education

Sheary Tan

OCT 27, 2025

You know what separates junior data scientists from senior ones? It’s not the machine learning algorithms they know or the neural networks they can build. It’s how they approach a new dataset.

Senior data scientists don’t rush to modeling. They spend time, real time, exploring, questioning, and understanding their data first. They know that a mediocre model built on deeply understood data will always outperform a sophisticated model built on data nobody truly explored.

This guide covers the essential EDA techniques every data scientist needs to master, the fundamental skills that turn raw datasets into insights and prevent costly mistakes down the line.

Why EDA Matters More Than You Think

Let’s talk about the real cost of skipping EDA.

Netflix doesn’t just recommend movies because they have fancy algorithms. They spend enormous time understanding viewing patterns, completion rates, rewatch behavior, and temporal trends.

Uber doesn’t forecast demand by throwing data into a model, they analyze ride patterns, geographic hotspots, time-of-day variations, and seasonal fluctuations first.

Tesla’s Autopilot processes sensor data through layers of exploratory analysis befo Fre any autonomous decision gets made.

These companies understand something crucial: insights come from understanding data, not just processing it.

When you skip EDA, here’s what happens:

You build models on biased samples without realizing it
Outliers corrupt your model parameters
Missing data patterns get ignored, introducing systemic errors
Feature relationships remain hidden, leaving predictive power on the table
Data quality issues propagate through your entire pipeline

EDA is the real work.

The Essential EDA Toolkit

Before we get into techniques, let’s establish the foundation. You need four core libraries that form the backbone of Python data analysis:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration for cleaner outputs
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# For Jupyter notebooks
%matplotlib inline

Pandas is your data manipulation workhorse. Every dataset becomes a DataFrame, and most of your time is spent transforming, filtering, and aggregating with Pandas operations.

NumPy powers the numerical operations behind the scenes. When you calculate means, standard deviations, or correlations. Matplotlib gives you low-level control over visualizations. Need precise control over every element? Matplotlib. Seaborn makes beautiful statistical graphics with less code. Built on Matplotlib but optimized for data exploration.

These four libraries cover 90% of what you’ll do during EDA. The remaining 10% involves specialized tools we’ll cover later.

Phase 1: Initial Data Assessment

Every EDA journey starts the same way: load your data and take that crucial first look.

# Load your dataset
df = pd.read_csv('your_data.csv')

# First impressions
print(df.head(10))  # Look at first 10 rows
print(df.tail(10))  # Check the end too (different patterns sometimes)
print(f"\nDataset shape: {df.shape[0]} rows, {df.shape[1]} columns")

But you’re not just running commands, you’re actively looking for signals:

Do column names make sense? Cryptic names like “col_a_23” or “unnamed_7” suggest poor data documentation. Rename them now before confusion multiplies.
What’s the data granularity? Does each row represent a customer, a transaction, a time period? Misunderstanding this corrupts everything downstream.
Are there obvious data types? Numbers, dates, categories, text? Your entire analysis strategy depends on correctly identifying these.
Initial red flags? Placeholder values like 999999, missing data showing as “N/A” or blank spaces, dates formatted as text?

Understanding Data Structure

# Comprehensive overview
print(df.info())

# Memory usage (important for large datasets)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Column data types
print("\nData types:")
print(df.dtypes.value_counts())

The .info() output is criminally underused. It shows you:

Non-null counts (revealing missing data immediately)
Data types (catching type mismatches)
Memory usage (identifying inefficiencies)

If you’re expecting 10,000 rows but a column shows 7,500 non-null entries, you’ve got 25% missing data. That’s not trivial, that needs investigation.

Statistical Snapshot

# Numerical columns summary
print(df.describe())

# Include percentiles for more detail
print(df.describe(percentiles=[.05, .25, .5, .75, .95]))

# Categorical columns
print(df.describe(include=['object']))

The .describe() method gives you the statistical skeleton of your numerical data. But here’s what you’re actually checking:

Does the mean make sense? If average age is 250, something’s wrong.
Are min/max values reasonable? Negative prices, impossibly high counts, dates in year 3050, these are data quality issues.
How spread is your data? Large standard deviation relative to mean suggests high variability or outliers.
Do quartiles reveal skewness? If the median (50th percentile) is much different from the mean, your distribution is skewed.

Phase 2: Data Quality Assessment

Clean data is rare. Most datasets have missing values, duplicates, inconsistencies, and errors. Finding them is essential.

Missing Data Analysis

Missing data isn’t just annoying, it’s often informative. The pattern of missingness tells you something about your data collection process or the underlying phenomenon.

# Count and percentage of missing values
missing = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum().values,
    'Percentage': (df.isnull().sum() / len(df) * 100).values
})

# Show only columns with missing data
missing = missing[missing['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
print(missing)

# Visualize missing data patterns
plt.figure(figsize=(14, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Data Pattern')
plt.show()

Ask critical questions:

Is missing data random or systematic?
Do certain rows have many missing values (data collection issue)?
Do certain columns have many missing values (variable not consistently recorded)?
Are there correlations in missingness (when column A is missing, column B is often missing too)?

These patterns determine how you handle missing data later. Random missingness is different from systematic missingness.

Duplicate Detection

Duplicates corrupt statistics, inflate counts, and break models.

# Complete duplicate rows
duplicates = df.duplicated().sum()
print(f"Complete duplicates: {duplicates}")

# View them
if duplicates > 0:
    print(df[df.duplicated(keep=False)].sort_values(by=df.columns.tolist()))

# Duplicates based on specific columns
# (e.g., customer_id should be unique)
if 'customer_id' in df.columns:
    id_duplicates = df.duplicated(subset=['customer_id']).sum()
    print(f"Duplicate customer IDs: {id_duplicates}")
    
# Find duplicates with tolerance (fuzzy matching)
# Useful for names or addresses
from fuzzywuzzy import fuzz
# Implementation depends on specific use case

Sometimes duplicates are obvious (identical rows). Sometimes they’re subtle (same entity with slightly different spelling or timestamps). Both need addressing.

Data Type Validation

Pandas infers data types, but it often guesses wrong.

# Check what should be numerical but isn't
for col in df.select_dtypes(include=['object']).columns:
    try:
        pd.to_numeric(df[col])
        print(f"{col} can be converted to numeric")
    except ValueError as e:
        print(f"{col} contains non-numeric values: {df[col].unique()[:5]}")

# Convert when appropriate
df['price'] = pd.to_numeric(df['price'].str.replace(',', ''), errors='coerce')

# Convert dates
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Check conversion success
print(f"Failed date conversions: {df['date'].isnull().sum()}")

This catches issues like:

Numbers stored as strings (with commas or currency symbols)
Dates as text instead of datetime objects
Categories as text when they should be categorical type
Integers stored as floats unnecessarily

Phase 3: Univariate Analysis

Now we explore individual variables to understand their distributions and characteristics.

Numerical Variables

For continuous data, you care about distribution shape, central tendency, spread, and outliers.

# For each numerical column
numerical_cols = df.select_dtypes(include=[np.number]).columns

for col in numerical_cols:
    fig, axes = plt.subplots(1, 4, figsize=(20, 4))
    
    # Histogram
    axes[0].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[0].set_title(f'{col} - Histogram')
    axes[0].set_xlabel(col)
    axes[0].set_ylabel('Frequency')
    
    # Box plot
    axes[1].boxplot(df[col].dropna())
    axes[1].set_title(f'{col} - Box Plot')
    axes[1].set_ylabel(col)
    
    # KDE plot
    df[col].dropna().plot(kind='kde', ax=axes[2])
    axes[2].set_title(f'{col} - KDE Plot')
    axes[2].set_xlabel(col)
    
    # Q-Q plot
    from scipy import stats
    stats.probplot(df[col].dropna(), dist="norm", plot=axes[3])
    axes[3].set_title(f'{col} - Q-Q Plot')
    
    plt.tight_layout()
    plt.show()
    
    # Statistical summary
    print(f"\n{col} Statistics:")
    print(f"Mean: {df[col].mean():.2f}")
    print(f"Median: {df[col].median():.2f}")
    print(f"Std Dev: {df[col].std():.2f}")
    print(f"Skewness: {df[col].skew():.2f}")
    print(f"Kurtosis: {df[col].kurtosis():.2f}")
    print(f"Range: [{df[col].min():.2f}, {df[col].max():.2f}]")

What you’re learning:

Distribution shape: Normal (bell curve), skewed (long tail one direction), bimodal (two peaks), uniform (flat)?
Skewness: Positive skew = long right tail (common with income, house prices). Negative skew = long left tail.
Kurtosis: High values = heavy tails (more outliers). Low values = light tails.
Outliers: Box plot whiskers show you immediately. Those dots beyond? Investigate them.
Q-Q plot: Points following the diagonal line = normally distributed. Deviations = non-normal distribution.

Categorical Variables

For categorical data, you care about frequency distributions and balance.

# For each categorical column
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

for col in categorical_cols:
    print(f"\n{col} Value Counts:")
    print(df[col].value_counts())
    print(f"\nProportions:")
    print(df[col].value_counts(normalize=True))
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Bar plot
    df[col].value_counts().plot(kind='bar', ax=axes[0])
    axes[0].set_title(f'{col} Distribution')
    axes[0].set_xlabel(col)
    axes[0].set_ylabel('Count')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Pie chart (if categories aren't too many)
    if df[col].nunique() <= 10:
        df[col].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
        axes[1].set_title(f'{col} Proportions')
        axes[1].set_ylabel('')
    
    plt.tight_layout()
    plt.show()
    
    # Cardinality check
    print(f"Unique values: {df[col].nunique()}")
    if df[col].nunique() > 50:
        print("⚠️ High cardinality - consider grouping")

Watch for:

Class imbalance: If predicting fraud and 99.9% of transactions are legitimate, standard models will fail.
High cardinality: 500 unique product types might need grouping into categories.
Unexpected values: “Unknown”, “N/A”, “Other” dominating suggests data quality issues.
Dominant categories: One category representing 90%+ of data might indicate a problem or severely limit model learning.

Phase 4: Bivariate Analysis

This is where insights emerge. You’re uncovering relationships between variables.

Numerical vs Numerical

Relationships between continuous variables reveal patterns, correlations, and potential causation.

# Scatter plot matrix for key variables
key_vars = ['age', 'income', 'spending', 'satisfaction']  # Adjust to your data
sns.pairplot(df[key_vars], diag_kind='kde', plot_kws={'alpha': 0.6})
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()

# Correlation analysis
correlation_matrix = df[numerical_cols].corr()

plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, 
            annot=True, 
            fmt='.2f',
            cmap='coolwarm',
            center=0,
            square=True,
            linewidths=1,
            cbar_kws={'shrink': 0.8})
plt.title('Correlation Heatmap', fontsize=16, pad=20)
plt.show()

# Find highly correlated pairs
high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            high_corr.append({
                'Variable 1': correlation_matrix.columns[i],
                'Variable 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

if high_corr:
    print("\nHighly Correlated Pairs (|r| > 0.7):")
    print(pd.DataFrame(high_corr))

Key insights:

Correlation ≠ causation: Just because two variables move together doesn’t mean one causes the other.
Multicollinearity warning: High correlation between independent variables causes problems in regression models.
Non-linear relationships: Correlation only captures linear relationships. A strong curved relationship might show weak correlation.

Categorical vs Numerical

How do numerical distributions differ across categories?

# Box plots by category
numerical_col = 'salary'
categorical_col = 'department'

plt.figure(figsize=(12, 6))
sns.boxplot(x=categorical_col, y=numerical_col, data=df)
plt.title(f'{numerical_col} by {categorical_col}')
plt.xticks(rotation=45)
plt.show()

# Violin plots (show full distribution)
plt.figure(figsize=(12, 6))
sns.violinplot(x=categorical_col, y=numerical_col, data=df)
plt.title(f'{numerical_col} Distribution by {categorical_col}')
plt.xticks(rotation=45)
plt.show()

# Statistical test for differences
from scipy.stats import f_oneway

groups = [df[df[categorical_col] == cat][numerical_col].dropna() 
          for cat in df[categorical_col].unique()]
f_stat, p_value = f_oneway(*groups)
print(f"ANOVA F-statistic: {f_stat:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("✓ Significant differences between groups")
else:
    print("✗ No significant differences between groups")

This reveals whether groups differ meaningfully. Engineering salaries centered at $100K vs marketing at $70K is actionable information.

Categorical vs Categorical

Relationships between categorical variables show associations and dependencies.

# Cross-tabulation
crosstab = pd.crosstab(df['gender'], df['department'])
print("Cross-tabulation:")
print(crosstab)

# Normalized (proportions)
crosstab_norm = pd.crosstab(df['gender'], df['department'], normalize='index')
print("\nRow-wise proportions:")
print(crosstab_norm)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Stacked bar chart
crosstab.plot(kind='bar', stacked=True, ax=axes[0])
axes[0].set_title('Department by Gender (Counts)')
axes[0].set_ylabel('Count')
axes[0].legend(title='Department')

# Grouped bar chart
crosstab.plot(kind='bar', ax=axes[1])
axes[1].set_title('Department by Gender (Side-by-side)')
axes[1].set_ylabel('Count')
axes[1].legend(title='Department')

plt.tight_layout()
plt.show()

# Chi-square test for independence
from scipy.stats import chi2_contingency

chi2, p_value, dof, expected = chi2_contingency(crosstab)
print(f"\nChi-square statistic: {chi2:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("✓ Variables are dependent")
else:
    print("✗ Variables are independent")

Phase 5: Outlier Detection and Treatment

Outliers are data points that deviate significantly from other observations. But not all outliers are errors, some are legitimate extreme values.

Detection Methods

def detect_outliers_iqr(data, column):
    """Detect outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    
    return outliers, lower_bound, upper_bound

def detect_outliers_zscore(data, column, threshold=3):
    """Detect outliers using Z-score method"""
    from scipy import stats
    z_scores = np.abs(stats.zscore(data[column].dropna()))
    return data[z_scores > threshold]

# Apply to a column
col = 'income'
iqr_outliers, lower, upper = detect_outliers_iqr(df, col)
zscore_outliers = detect_outliers_zscore(df, col)

print(f"IQR method: {len(iqr_outliers)} outliers (bounds: [{lower:.2f}, {upper:.2f}])")
print(f"Z-score method: {len(zscore_outliers)} outliers")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Box plot with outliers highlighted
axes[0].boxplot(df[col].dropna())
axes[0].set_title(f'{col} - Box Plot')
axes[0].set_ylabel(col)

# Scatter plot with outliers colored
axes[1].scatter(range(len(df)), df[col], alpha=0.5, label='Normal')
axes[1].scatter(iqr_outliers.index, iqr_outliers[col], 
                color='red', alpha=0.7, label='Outliers')
axes[1].set_title(f'{col} - Outliers Highlighted')
axes[1].set_xlabel('Index')
axes[1].set_ylabel(col)
axes[1].legend()

plt.tight_layout()
plt.show()

Treatment Strategies

What you do with outliers depends on whether they’re errors or legitimate extremes: If errors: Remove or impute with reasonable values.

If legitimate: Keep but use robust methods (median instead of mean), transform the data (log transformation), or cap extreme values (winsorization).

# Option 1: Remove outliers
df_no_outliers = df[~df.index.isin(iqr_outliers.index)]

# Option 2: Cap values (winsorization)
from scipy.stats.mstats import winsorize
df['income_capped'] = winsorize(df['income'], limits=[0.05, 0.05])

# Option 3: Log transformation (for right-skewed data)
df['log_income'] = np.log1p(df['income'])  # log1p handles zeros

# Compare distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

df['income'].hist(bins=30, ax=axes[0])
axes[0].set_title('Original')

df['income_capped'].hist(bins=30, ax=axes[1])
axes[1].set_title('Capped')

df['log_income'].hist(bins=30, ax=axes[2])
axes[2].set_title('Log Transformed')

plt.tight_layout()
plt.show()

Phase 6: Feature Engineering During EDA

EDA often reveals opportunities for better features, derived variables that capture relationships more clearly.

# Create age groups from continuous age
df['age_group'] = pd.cut(df['age'], 
                          bins=[0, 18, 35, 50, 65, 100],
                          labels=['<18', '18-35', '36-50', '51-65', '65+'])

# Extract datetime features
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter

# Interaction features
df['price_per_sqft'] = df['price'] / df['square_feet']
df['income_to_age_ratio'] = df['income'] / df['age']

# Polynomial features
df['age_squared'] = df['age'] ** 2

# Binning continuous variables
df['income_bracket'] = pd.qcut(df['income'], 
                                 q=5, 
                                 labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Encoding categorical variables
df_encoded = pd.get_dummies(df, columns=['category'], drop_first=True, prefix='cat')

# Target encoding (use with caution - risk of leakage)
category_means = df.groupby('category')['target'].mean()
df['category_target_encoded'] = df['category'].map(category_means)

During EDA, constantly ask: “What features would make relationships clearer?” or “How can I represent this information better?”

How Livedocs Transforms EDA Workflows

Everything we’ve covered represents the traditional approach: write code, generate outputs, interpret results. It works, and understanding these fundamentals is essential for any data scientist.

But here’s where modern tools like Livedocs fundamentally change the game.

The AI-Assisted EDA Experience

Instead of writing pandas code to explore your data, imagine working like this:

You: “Show me the distribution of customer ages and flag any outliers”

AI: Generates code, executes it, presents histogram and box plot with outliers highlighted

You: “How does purchase amount correlate with customer tenure? Break it down by customer segment.”

AI: Creates correlation analysis, generates scatter plots colored by segment, calculates segment-specific correlations

You: “Are there any concerning patterns in the missing data?”

AI: Analyzes missingness, creates heatmap, identifies that data is systematically missing for certain customer types

This isn’t just faster, it’s qualitatively different. You’re having a conversation with your data instead of translating analytical thoughts into code syntax.

Context-Aware AI That Knows Your Data

This is where it gets interesting. Livedocs’ AI doesn’t just help you write code, it understands your specific data schema.

Connect your PostgreSQL database or upload your CSV files. The AI knows your tables, columns, relationships, and data types. When you ask “What are our highest, value customers?” it knows which tables to query and how to define “value” based on your schema.

You: "Compare Q4 revenue to Q3 by product category, but only for categories with >$100K in sales"

AI: *Understands your date column, revenue column, and category column*
     *Writes SQL with proper joins and filters*
     *Executes query*
     *Generates comparative visualization*
     *All in seconds*

Traditional approach: write SQL, run query, export to pandas, create visualization, format chart. Takes minutes.

Livedocs approach: ask question, get answer. Takes seconds.

Bringing It All Together: The Complete EDA Workflow

Here’s what a modern, comprehensive EDA process looks like:

# Phase 1: Initial Assessment
df = pd.read_csv('data.csv')
print(df.head())
print(df.info())
print(df.describe())

# Phase 2: Automated Overview
from ydata_profiling import ProfileReport
profile = ProfileReport(df, minimal=True)
profile.to_file('quick_profile.html')

# Phase 3: Data Quality
print("Missing data:\n", df.isnull().sum())
print("\nDuplicates:", df.duplicated().sum())

# Phase 4: Type Validation and Conversion
numerical = df.select_dtypes(include=[np.number]).columns
categorical = df.select_dtypes(include=['object']).columns

# Phase 5: Univariate Analysis
for col in numerical:
    # Distribution analysis with visualizations
    # (code from Phase 3 section)
    pass

# Phase 6: Bivariate Analysis
correlation = df[numerical].corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.show()

# Phase 7: Outlier Detection
for col in numerical:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    print(f"{col}: {len(outliers)} outliers")

# Phase 8: Feature Engineering
# Create derived features based on insights

# Phase 9: Document Findings
# Create visualizations for communication
# Write interpretations and next steps

Or in Livedocs:

"Give me a comprehensive overview of this dataset including data quality issues, distributions, correlations, and outliers"

[AI generates complete analysis]

"Now compare how these patterns differ between customer segments"

[AI breaks down analysis by segments]

"What features should we engineer for predicting churn?"

[AI suggests features based on relationships found]

Exploratory Data Analysis is where data science begins. Not with fancy algorithms or complex models, but with simple questions: What’s in my data? What patterns exist? What problems need solving?

The techniques in this guide, understanding distributions, detecting outliers, exploring relationships, assessing quality, are fundamental skills every data scientist must master. They’re not the flashy parts of the job, but they’re the foundation everything else builds upon.

Final Thoughts

Master these essentials. Practice them until they become intuitive. Then push further, develop domain expertise, learn advanced techniques, explore new tools.

But never skip the fundamentals. No matter how sophisticated your models become, they’re only as good as your understanding of the data going into them.

And that understanding? It starts with exploration.

And Livedocs is the best tool to get started. Try out Livedocs now.

8x speed response
Ask agent to find datasets for you
Set system rules for agent
Collaborate
And more

Get started with Livedocs and build your first live notebook in minutes.

💬 If you have questions or feedback, please email directly at a[at]livedocs[dot]com
📣 Take Livedocs for a spin over at livedocs.com/start. Livedocs has a great free plan, with $10 per month of LLM usage on every plan
🤝 Say hello to the team on X and LinkedIn

Stay tuned for the next tutorial!