DQ Check for DataFrame - Complete Guide to Data Quality Validation
Introduction to DQ Check-
DQ Check (Data Quality Check) is the process of validating data to ensure it is accurate, complete, consistent, and reliable before analysis or machine learning tasks.

In data engineering and data science projects, DataFrames (Pandas or Spark) are widely used. Performing DQ checks on DataFrames helps:
- Detect missing or invalid values
- Ensure correct data types
- Identify duplicates
- Improve ML model accuracy
- Prevent pipeline failures
Why DQ Check is Important?
Poor data quality leads to:
- Wrong business insights
- Poor ML model performance
- Data pipeline failures
- Incorrect reporting
A proper DQ check ensures clean, trustworthy, and usable data for analytics and AI models.
Common Data Quality Checks for DataFrame
1.Null / Missing Value Check
Pandas Example
df.isnull().sum()Spark Example
from pyspark.sql.functions import coldf.select([col(c).isNull().sum().alias(c) for c in df.columns]).show()
Purpose: Identify missing values that can affect model accuracy.
2. Duplicate Records Check
Pandas
df.duplicated().sum()Spark
df.groupBy(df.columns).count().filter("count > 1").show()
Duplicate data leads to biased results and wrong aggregations.
3.Data Type Validation
Pandas
df.dtypesSpark
df.printSchema()
Ensure correct types:
- Age → Integer
- Salary → Float
- Date → Timestamp
4.Range Validation (Business Rules)
Example:
- Age should be between 18–60
- Salary must be greater than 0
df[(df['Age'] < 18) | (df['Age'] > 60)]Helps detect invalid or corrupted values.
5. Unique Value Check
df['EmployeeID'].nunique()Ensures primary keys are unique.
6.Outlier Detection
Q1 = df['Salary'].quantile(0.25)Q3 = df['Salary'].quantile(0.75)IQR = Q3 - Q1df[(df['Salary'] < Q1–1.5 * IQR) | (df['Salary'] > Q3 + 1.5 * IQR)]
Detects abnormal salary or numeric values.
7.Consistency Check
df['Gender'].value_counts()Valid: Male, Female
Invalid: M, male, FEMALE, null
Use mapping or standardization to fix inconsistencies.
8.Referential Integrity Check
Used when multiple tables exist.
df_orders[~df_orders['cust_id'].isin(df_customers['cust_id'])]Ensures foreign key consistency.
DQ Check Using PySpark (Production Ready)
from pyspark.sql.functions import col# Null checkdf.select([col(c).isNull().sum().alias(c) for c in df.columns]).show()# Duplicate checkdf.groupBy(df.columns).count().filter("count > 1").show()# Range validationdf.filter((col("Age") < 18) | (col("Age") > 60)).show()
Benefits of Data Quality Checks
- Improves ML accuracy
- Prevents ETL failures
- Enhances business trust
- Detects data drift
- Ensures reliable reporting
Best Practices for DQ Checks
- Run DQ checks before model training
- Automate validations in ETL pipelines
- Maintain DQ logs
- Use tools like Great Expectations
- Validate both source and target data
Conclusion
Data Quality Checks (DQ Checks) are a critical part of modern data engineering and analytics pipelines. Clean data ensures accurate predictions, better business decisions, and stable ML systems.
Bad Data = Bad Decisions
Clean Data = Reliable Insights
About the Author
I am a Data Science Engineer specializing in Machine Learning, Generative AI,
Cloud Computing, Hadoop, Scala, Java, and Python With expertise in
cutting-edge technologies, I share valuable insights, blogging tips, and
tech tutorials on DeveloperIndian.com, helping developers and data enthusiasts
stay ahead in the industry.
Comments
Post a Comment