Enterprise AI/ML Checklist

Data Management

Required

Data Pipeline
Implementation Questions:
- What data sources are integrated and how are they validated?
- How do you handle different data formats and schemas?
- What preprocessing steps are automated vs. manual?
- How is pipeline performance and throughput monitored?
- What happens when upstream data sources change or fail?
- How do you handle late-arriving or out-of-order data?
Key Considerations:
- Implement data validation at ingestion points
- Design for scalability and parallel processing
- Include retry mechanisms and error handling
- Maintain data lineage and audit trails
Red Flags:
- Manual data preprocessing steps that aren't documented
- Pipeline failures that go unnoticed for extended periods
- No rollback mechanism for bad data
- Hardcoded assumptions about data structure or quality
Data Quality
Implementation Questions:
- What data quality metrics do you track (completeness, accuracy, consistency, timeliness)?
- How do you define and measure data quality thresholds?
- What automated checks run during data ingestion and processing?
- How do you handle data that fails quality checks?
- What is your process for investigating and resolving data quality issues?
- How often are data quality rules reviewed and updated?
Key Considerations:
- Implement both statistical and business rule-based validation
- Create data quality dashboards and alerting
- Establish data quality SLAs with upstream providers
- Document data quality exceptions and remediation processes
Red Flags:
- Models trained on poor quality data without validation
- No documented data quality standards or expectations
- Quality issues discovered only after model deployment
- Manual data quality checks that create bottlenecks
Data Versioning
Implementation Questions:
- How do you version datasets and track changes over time?
- Can you trace data lineage from raw source to model predictions?
- How do you handle schema evolution and backward compatibility?
- What tools do you use for data versioning (DVC, Delta Lake, etc.)?
- How do you ensure reproducibility of model training with specific data versions?
- Can you quickly rollback to previous data versions if issues are discovered?
Key Considerations:
- Implement automated data versioning in CI/CD pipelines
- Tag data versions with metadata (quality scores, source information)
- Maintain data lineage graphs for impact analysis
- Establish naming conventions and versioning strategies
Red Flags:
- Unable to reproduce model results due to unknown data versions
- No tracking of which data version was used for specific models
- Manual data versioning processes prone to human error
- Data changes that break downstream processes without warning
Data Privacy
Implementation Questions:
- What data privacy regulations apply to your ML systems (GDPR, CCPA, HIPAA)?
- How do you handle personally identifiable information (PII) in training data?
- What data anonymization or pseudonymization techniques do you use?
- How do you implement "right to be forgotten" requests in ML systems?
- Do you conduct privacy impact assessments for new ML projects?
- How do you ensure data minimization principles are followed?
Key Considerations:
- Implement privacy-preserving ML techniques (differential privacy, federated learning)
- Establish data governance policies and procedures
- Regular compliance audits and documentation
- Train teams on privacy regulations and best practices
Red Flags:
- Using raw PII in model training without proper anonymization
- No clear data retention and deletion policies
- Inability to comply with data subject access requests
- Cross-border data transfers without adequate safeguards

Suggested

Data Catalog
Implementation Questions:
- What metadata is captured for each dataset (schema, quality, usage, ownership)?
- How do users discover and search for relevant datasets across the organization?
- What data lineage information is available through the catalog?
- How do you track data usage patterns and popularity metrics?
- What approval processes exist for accessing cataloged datasets?
- How is the data catalog integrated with your ML development workflow?
Key Considerations:
- Implement automated metadata extraction and profiling
- Provide search and filtering capabilities with business-friendly terminology
- Include data quality scores and freshness indicators
- Enable collaboration features like comments and ratings
Red Flags:
- Data catalog that becomes stale or out of sync with actual data
- No governance around who can publish or modify catalog entries
- Catalog that exists in isolation from data science workflows
- Missing critical metadata that makes datasets unusable
Data Augmentation
Implementation Questions:
- What data augmentation techniques are appropriate for your data type and domain?
- How do you validate that augmented data maintains realistic characteristics?
- What is the optimal ratio of original to augmented training data?
- How do you prevent data leakage when applying augmentation techniques?
- What tools and libraries do you use for systematic data augmentation?
- How do you measure the impact of augmentation on model performance?
Key Considerations:
- Apply domain-specific augmentation (rotation/scaling for images, paraphrasing for text)
- Use progressive augmentation strategies that increase complexity gradually
- Maintain separate validation sets without augmented data
- Consider advanced techniques like GAN-based synthetic data generation
Red Flags:
- Applying inappropriate augmentation that creates unrealistic data
- Augmentation that introduces bias or degrades data quality
- No validation of augmented data quality or realism
- Augmentation applied to validation/test sets causing overly optimistic results

Model Development

Required

Model Architecture
Implementation Questions:
- Why was this specific model architecture chosen for the problem domain?
- What alternative architectures were considered and why were they rejected?
- How do model complexity and interpretability requirements balance?
- What are the computational requirements and performance characteristics?
- How does the architecture handle different input sizes and data types?
- What are the model's scalability limitations and bottlenecks?
Key Considerations:
- Document architectural decision records (ADRs) with rationale
- Consider model inference latency and throughput requirements
- Plan for model updates and architecture evolution
- Validate architecture against business and technical constraints
Red Flags:
- Using complex models without justification for simple problems
- No documentation of why specific architecture was chosen
- Architecture that doesn't align with deployment constraints
- Over-engineering or under-engineering for the use case
Training Pipeline
Implementation Questions:
- What steps ensure exact reproducibility of training runs (seed management, environment isolation)?
- How do you handle hyperparameter configuration and experimentation?
- What compute resource management and scaling strategies are implemented?
- How do you track and log training progress, metrics, and intermediate results?
- What failure recovery and checkpointing mechanisms exist?
- How do you handle distributed training across multiple GPUs or nodes?
Key Considerations:
- Containerize training environments with fixed dependency versions
- Implement automated hyperparameter tracking and experiment logging
- Use configuration files to manage training parameters
- Implement early stopping and model checkpointing strategies
Red Flags:
- Training runs that cannot be reproduced due to missing configuration
- Manual training processes prone to human error
- No monitoring of training resource usage and costs
- Training failures that lose hours or days of computation
Model Validation
Implementation Questions:
- What validation techniques are used (cross-validation, holdout, time-series splits)?
- How do you ensure validation data is representative of production data?
- What metrics are tracked beyond accuracy (precision, recall, F1, AUC)?
- How do you validate model performance across different data segments?
- What statistical tests are used to assess model significance?
- How do you validate model robustness and edge case handling?
Key Considerations:
- Use stratified sampling to maintain class balance in validation sets
- Implement business-relevant metrics alongside technical metrics
- Validate model performance across different time periods and conditions
- Include bias and fairness assessments in validation procedures
Red Flags:
- Data leakage between training and validation sets
- Validation only on clean, perfect data that doesn't reflect reality
- Over-reliance on single metrics without considering business impact
- No validation of model behavior on edge cases or adversarial inputs
Model Versioning
Implementation Questions:
- How do you version and track model artifacts (weights, architecture, hyperparameters)?
- What metadata is stored with each model version (performance metrics, training data, timestamp)?
- How do you handle model promotion through development, staging, and production environments?
- What rollback procedures exist when new model versions underperform?
- How do you manage model dependencies and environment compatibility?
- What tools do you use for model registry and lifecycle management?
Key Considerations:
- Use semantic versioning with clear promotion criteria between versions
- Store complete model lineage including data versions and code commits
- Implement automated model validation before version promotion
- Maintain model performance benchmarks across versions
Red Flags:
- No systematic approach to model versioning and tracking
- Production models that cannot be traced to specific training runs
- Model artifacts stored without proper metadata or lineage
- No rollback strategy when models degrade in production

Suggested

AutoML
Implementation Questions:
- What aspects of ML are automated (feature engineering, model selection, hyperparameter tuning)?
- How do you balance automation with domain expertise and interpretability needs?
- What constraints and guardrails ensure AutoML produces sensible results?
- How do you validate and interpret models generated through automated processes?
- What human oversight exists in the AutoML pipeline?
- How do you integrate AutoML results with existing ML workflows?
Key Considerations:
- Start with neural architecture search or automated hyperparameter optimization
- Implement budget constraints for computational resources and time
- Maintain human review checkpoints for model quality and business alignment
- Document automated decisions for reproducibility and compliance
Red Flags:
- Fully automated model deployment without human validation
- AutoML systems that produce unexplainable or biased models
- No constraints on computational resources or training time
- Automated processes that ignore domain-specific requirements
Transfer Learning
Implementation Questions:
- What pre-trained models are available and suitable for your domain and task?
- How do you assess the quality and relevance of source domain data?
- What strategies do you use for fine-tuning (feature extraction, fine-tuning, domain adaptation)?
- How do you prevent negative transfer when source and target domains differ significantly?
- What validation approaches ensure transferred knowledge improves performance?
- How do you handle licensing and intellectual property concerns with pre-trained models?
Key Considerations:
- Start with models pre-trained on large, diverse datasets relevant to your domain
- Implement layer freezing strategies and progressive unfreezing during fine-tuning
- Use smaller learning rates for pre-trained layers compared to new layers
- Validate that transfer learning provides better results than training from scratch
Red Flags:
- Using pre-trained models without validating domain relevance
- Fine-tuning with learning rates that destroy pre-trained features
- No comparison between transfer learning and training from scratch
- Using models with unclear licensing or provenance

MLOps

Required

CI/CD Pipeline
Implementation Questions:
- What ML-specific tests are automated (data validation, model testing, performance regression)?
- How do you handle model artifacts in your CI/CD pipeline?
- What triggers automated retraining and model updates?
- How do you implement canary deployments and A/B testing for models?
- What rollback mechanisms exist when model deployments fail?
- How do you manage environment consistency across development, staging, and production?
Key Considerations:
- Implement automated data quality checks and schema validation
- Include model performance benchmarks in your test suite
- Use containerization for consistent ML environment deployment
- Implement gradual rollout strategies with performance monitoring
Red Flags:
- Manual model deployment processes prone to human error
- No automated testing of model performance or data quality
- All-or-nothing deployments without gradual rollout capabilities
- Environment inconsistencies between development and production
Model Registry
Implementation Questions:
- What information is tracked for each model in the registry (version, performance, metadata)?
- How do you manage model promotion through different environments?
- What approval workflows exist for model deployment to production?
- How do you track which model versions are deployed where?
- What lineage information connects models to training data and code?
- How do you handle model deprecation and retirement?
Key Considerations:
- Implement automated model registration from training pipelines
- Store comprehensive model metadata including performance benchmarks
- Enable model comparison and diff capabilities
- Integrate with deployment tools for consistent model management
Red Flags:
- Models deployed without proper registry tracking
- No clear promotion criteria between registry stages
- Missing model metadata that prevents effective comparison
- Registry that becomes disconnected from actual deployments
Model Serving
Implementation Questions:
- What model serving framework supports your latency and throughput requirements?
- How do you handle model loading, caching, and memory management?
- What auto-scaling policies respond to prediction request volume?
- How do you implement batch vs. real-time serving strategies?
- What monitoring exists for model serving performance and errors?
- How do you handle model versioning and A/B testing in serving?
Key Considerations:
- Choose serving solutions that match your latency requirements (milliseconds vs. seconds)
- Implement model caching and pre-loading strategies
- Use containerized serving with orchestration for scalability
- Monitor resource utilization and cost optimization opportunities
Red Flags:
- Serving infrastructure that cannot handle peak traffic loads
- No monitoring of model serving latency and availability
- Models that require manual deployment and scaling
- Serving costs that are disproportionate to business value
Monitoring Setup
Implementation Questions:
- What model performance metrics are monitored in real-time?
- How do you detect and alert on data drift and model degradation?
- What infrastructure metrics track serving performance and availability?
- How do you correlate model performance with business outcomes?
- What dashboards provide visibility into model health?
- How do you handle alert fatigue while maintaining responsiveness?
Key Considerations:
- Monitor both technical metrics (latency, error rates) and business metrics (accuracy, precision)
- Implement statistical process control for drift detection
- Create tiered alerting based on severity and impact
- Provide self-service dashboards for stakeholders
Red Flags:
- Model degradation discovered weeks after it begins
- No correlation between model metrics and business impact
- Alert storms that overwhelm the team
- Manual monitoring processes that create delays

Suggested

Feature Store
Implementation Questions:
- What features are centrally managed and shared across ML projects?
- How do you ensure feature consistency between training and serving?
- What feature versioning and lineage tracking capabilities exist?
- How do you handle real-time vs. batch feature computation?
- What feature discovery and reuse mechanisms are available?
- How do you monitor feature quality and freshness?
Key Considerations:
- Implement both online and offline feature stores for different use cases
- Establish feature ownership and governance processes
- Enable feature sharing and discovery across teams
- Monitor feature drift and data quality
Red Flags:
- Training/serving skew due to inconsistent feature computation
- Duplicate feature engineering across multiple projects
- No monitoring of feature data quality or freshness
- Features that become stale or orphaned over time
A/B Testing
Implementation Questions:
- How do you design and implement experiments to compare model versions?
- What traffic splitting mechanisms ensure unbiased experiment results?
- What metrics and statistical tests determine experiment success?
- How do you handle experiment assignment consistency and user experience?
- What guardrails prevent harmful experiments from impacting users?
- How do you analyze and interpret A/B test results?
Key Considerations:
- Implement proper randomization and stratification strategies
- Define clear success metrics and statistical significance thresholds
- Use multi-armed bandit approaches for continuous optimization
- Monitor both short-term and long-term impact metrics
Red Flags:
- A/B tests without proper statistical power or significance testing
- Biased traffic splitting that skews experiment results
- No monitoring of experiment impact on user experience
- Experiments that run indefinitely without clear conclusion

Model Monitoring

Required

Performance Monitoring
Implementation Questions:
- What performance metrics are continuously monitored in production?
- How do you establish baseline performance and acceptable thresholds?
- What is your process for investigating performance degradation?
- How do you differentiate between model degradation and data changes?
- What automated alerts are triggered by performance issues?
- How do you correlate model performance with business outcomes?
Key Considerations:
- Monitor both technical metrics (accuracy, latency) and business metrics (conversion rates, revenue impact)
- Implement statistical process control for performance monitoring
- Create performance dashboards accessible to both technical and business stakeholders
- Establish clear escalation procedures for performance issues
Red Flags:
- Performance degradation discovered weeks or months after it began
- No correlation between model performance and business impact
- Alerting fatigue from too many false positive alerts
- Manual performance monitoring that creates delays in issue detection
Drift Detection
Implementation Questions:
- What statistical methods do you use to detect data drift (KS test, population stability index, chi-square)?
- How do you differentiate between seasonal changes and true data drift?
- What concept drift detection techniques monitor model performance degradation over time?
- How do you establish drift detection thresholds and sensitivity levels?
- What is your process for investigating and responding to drift alerts?
- How do you monitor drift across different data segments and feature subsets?
Key Considerations:
- Implement both univariate and multivariate drift detection techniques
- Monitor both input features and target variable distributions
- Use sliding window approaches to capture temporal drift patterns
- Establish different alert thresholds for different business contexts
Red Flags:
- Drift detection that generates too many false alarms or misses significant shifts
- No distinction between expected seasonal patterns and problematic drift
- Drift detection that operates only at aggregate level without segment analysis
- Alert systems that identify drift but provide no actionable response guidance
Resource Monitoring
Implementation Questions:
- What computational resources are monitored (CPU, GPU, memory, disk I/O, network bandwidth)?
- How do you track and allocate costs across different models and projects?
- What resource utilization thresholds trigger scaling or optimization actions?
- How do you monitor and predict resource needs for different workload patterns?
- What tools provide visibility into resource usage across training and inference workloads?
- How do you identify and eliminate resource waste and inefficiencies?
Key Considerations:
- Implement real-time resource monitoring with historical trending
- Set up cost allocation and chargeback mechanisms for ML workloads
- Monitor both infrastructure and application-level resource metrics
- Create automated scaling policies based on resource utilization patterns
Red Flags:
- Resource costs that exceed project budgets without early warning
- No visibility into which models or experiments consume the most resources
- Resource bottlenecks that cause training or inference delays
- Underutilized resources that could be optimized or repurposed
Alert System
Implementation Questions:
- What alert severities and escalation procedures are defined for different types of model issues?
- How do you prevent alert fatigue while maintaining timely incident response?
- What notification channels and routing rules ensure alerts reach the right people?
- How do you correlate multiple alerts to identify root causes and avoid duplicate notifications?
- What self-healing or automated remediation actions can be triggered by alerts?
- How do you track alert response times and resolution effectiveness?
Key Considerations:
- Implement tiered alerting based on business impact and urgency
- Use intelligent alert grouping and correlation to reduce noise
- Create runbooks and automated responses for common alert scenarios
- Establish clear ownership and on-call rotation for model monitoring
Red Flags:
- Alert storms that overwhelm teams and mask critical issues
- Alerts that fire frequently but don't require action, leading to ignored notifications
- No clear escalation path when initial alert responders are unavailable
- Alerts that identify problems but provide no context for resolution

Suggested

Automated Retraining
Implementation Questions:
- What conditions trigger automated retraining (performance degradation, data drift, time intervals)?
- How do you ensure retraining uses appropriate data windows and validation procedures?
- What quality gates and approval processes govern automated model deployment?
- How do you handle retraining failures and fallback to previous model versions?
- What resource management ensures retraining doesn't interfere with production workloads?
- How do you track and compare the performance of automatically retrained models?
Key Considerations:
- Implement gradual rollout strategies for automatically retrained models
- Use comprehensive validation pipelines before deploying retrained models
- Monitor business metrics, not just technical metrics, when evaluating retraining success
- Maintain human oversight and intervention capabilities in automated workflows
Red Flags:
- Automated retraining that deploys models without adequate validation
- Retraining triggers that fire too frequently, causing model instability
- No fallback mechanism when automated retraining produces worse models
- Resource competition between retraining and production inference workloads
Performance Dashboard
Implementation Questions:
- What key performance indicators are displayed for different stakeholder groups (technical teams, business users)?
- How do you visualize trends, anomalies, and comparisons across multiple models and time periods?
- What drill-down capabilities allow users to investigate performance issues or understand model behavior?
- How do you ensure dashboards update in real-time and handle high-volume metrics efficiently?
- What customization options allow different teams to focus on relevant metrics for their roles?
- How do you integrate model performance with business outcome metrics in unified views?
Key Considerations:
- Design role-based dashboards that show relevant metrics for each audience
- Include both technical metrics (accuracy, latency) and business metrics (revenue impact, user satisfaction)
- Provide contextual information and alerts that help users understand when action is needed
- Enable easy sharing and collaboration features for dashboard insights
Red Flags:
- Dashboards that show data without context or actionable insights
- Performance visualizations that are too technical for business stakeholders
- Dashboards that become outdated or show stale data during critical periods
- No clear connection between model performance metrics and business value delivered

Ethics & Fairness

Required

Bias Detection
Implementation Questions:
- What protected attributes and sensitive variables are monitored for bias?
- How do you measure and quantify different types of bias (statistical, individual, group)?
- What bias detection tools and techniques are integrated into your pipeline?
- How do you balance fairness constraints with model performance?
- What is your process for addressing bias when it's detected?
- How do you validate bias mitigation effectiveness over time?
Key Considerations:
- Implement multiple fairness metrics (demographic parity, equalized odds, individual fairness)
- Use techniques like adversarial debiasing, fairness constraints, or post-processing
- Document bias assessment results and mitigation strategies
- Involve diverse stakeholders in defining fairness requirements
Red Flags:
- No systematic bias testing across different demographic groups
- Relying solely on overall accuracy metrics without fairness assessment
- Bias discovered only after negative real-world impact
- No clear process for handling fairness-performance trade-offs
Fairness Metrics
Implementation Questions:
- Which fairness metrics are most appropriate for your use case (demographic parity, equalized odds, equalized opportunity)?
- How do you define and measure fairness across different protected attributes (race, gender, age, disability)?
- What thresholds and acceptable ranges are established for each fairness metric?
- How do you handle intersectionality when multiple protected attributes are involved?
- What trade-offs exist between different fairness metrics and overall model performance?
- How frequently are fairness metrics evaluated and reported to stakeholders?
Key Considerations:
- Implement multiple fairness metrics as no single metric captures all aspects of fairness
- Establish fairness thresholds in collaboration with domain experts and affected communities
- Create fairness dashboards and regular reporting mechanisms for stakeholders
- Document the rationale behind chosen fairness metrics and thresholds
Red Flags:
- Relying on a single fairness metric without considering others
- Setting fairness thresholds without input from affected communities or stakeholders
- Fairness metrics that are measured only during development but not in production
- No clear documentation of fairness requirements and trade-offs
Model Explainability
Implementation Questions:
- What explainability techniques are appropriate for your model type (LIME, SHAP, attention maps, feature importance)?
- How do you provide explanations at different levels (global, local, counterfactual)?
- What is the target audience for explanations (data scientists, business users, end users, regulators)?
- How do you validate that explanations are accurate and meaningful?
- What processes ensure explanations remain consistent as models are updated?
- How do you balance explainability requirements with model performance and complexity?
Key Considerations:
- Implement multiple explanation techniques to provide comprehensive interpretability
- Tailor explanation formats and complexity to different stakeholder needs
- Validate explanations against domain expert knowledge and intuition
- Integrate explainability tools into model development and deployment workflows
Red Flags:
- Using black-box models in high-stakes decisions without adequate explainability
- Explanations that are technically correct but incomprehensible to intended users
- No validation of explanation accuracy or faithfulness to actual model behavior
- Explanations that become outdated when models are retrained or updated
Ethics Guidelines
Implementation Questions:
- What ethical principles and values guide your AI development (transparency, accountability, human oversight)?
- How do you translate high-level ethical principles into specific, actionable guidelines for teams?
- What review processes ensure AI projects adhere to ethical guidelines throughout development?
- How do you handle ethical dilemmas and conflicts between different ethical principles?
- What training and education programs ensure team members understand and apply ethical guidelines?
- How do you update and evolve ethical guidelines as AI technology and societal norms change?
Key Considerations:
- Develop clear, specific guidelines that address common ethical issues in AI development
- Implement mandatory ethics reviews at key project milestones
- Provide practical tools and checklists to help teams apply ethical principles
- Create escalation procedures for complex ethical decisions and conflicts
Red Flags:
- Vague ethical statements without specific implementation guidance
- No formal process for reviewing ethical implications of AI projects
- Ethics considered only at the end of development rather than throughout
- No mechanisms for updating guidelines as ethical understanding evolves

Suggested

Ethics Board
Implementation Questions:
- Who should serve on the ethics review board (ethicists, domain experts, affected community representatives, legal counsel)?
- What authority does the ethics board have to approve, modify, or reject AI projects?
- What criteria and processes guide the board's review of AI projects?
- How frequently does the board meet and what triggers a review?
- What documentation and evidence must projects provide for ethical review?
- How does the board handle appeals and disputes regarding ethical decisions?
Key Considerations:
- Include diverse perspectives and expertise relevant to your AI applications
- Establish clear review criteria and decision-making processes
- Provide board members with adequate training on AI technology and ethics
- Create efficient processes that don't create unnecessary project delays
Red Flags:
- Ethics board composed only of internal technical staff without external perspectives
- Review board with no real authority to influence or stop problematic projects
- Inconsistent or unclear review criteria leading to arbitrary decisions
- Board meetings that are infrequent or reactive rather than proactive
Impact Assessment
Implementation Questions:
- What potential impacts (social, economic, environmental, psychological) are systematically assessed for each AI system?
- How do you identify and engage with stakeholders who may be affected by your AI systems?
- What methods are used to assess both intended and unintended consequences of AI deployment?
- How frequently are impact assessments updated as AI systems evolve and new evidence emerges?
- What mitigation strategies are developed for negative impacts identified in assessments?
- How are impact assessment findings communicated to stakeholders and decision-makers?
Key Considerations:
- Conduct assessments early in development and update regularly throughout the AI system lifecycle
- Include diverse stakeholder perspectives in impact identification and assessment
- Consider both short-term and long-term impacts across different communities and contexts
- Develop concrete mitigation plans for identified negative impacts
Red Flags:
- Impact assessments conducted only after AI systems are fully developed or deployed
- Assessments that consider only technical impacts without broader societal effects
- No follow-up monitoring to validate impact assessment predictions
- Limited stakeholder engagement resulting in blind spots about potential impacts

Documentation

Required

Model Cards
Implementation Questions:
- What model performance metrics, limitations, and intended use cases are documented in each model card?
- How do you document training data characteristics, biases, and representativeness?
- What evaluation datasets and fairness assessments are included?
- How do you document model architecture decisions and hyperparameter choices?
- What ethical considerations and potential risks are outlined?
- How often are model cards updated as models evolve or new issues are discovered?
Key Considerations:
- Follow standardized model card templates (Google Model Cards, Hugging Face Cards)
- Include quantitative performance metrics across different demographic groups
- Document known limitations, failure modes, and out-of-scope use cases
- Make model cards accessible to both technical and non-technical stakeholders
Red Flags:
- Model cards that are created once and never updated
- Missing documentation of model limitations or potential biases
- Technical jargon that makes cards inaccessible to business stakeholders
- No process for reviewing and approving model card content
Data Documentation
Implementation Questions:
- What metadata is captured for each data source (origin, freshness, quality, ownership)?
- How do you document data schema evolution and backward compatibility?
- What transformations, feature engineering, and preprocessing steps are documented?
- How do you track data lineage from raw sources to final features?
- What data quality issues and remediation steps are documented?
- How do you document data sampling strategies and potential biases?
Key Considerations:
- Create data dictionaries with business-friendly descriptions
- Document data collection methodologies and potential sampling biases
- Maintain version-controlled schema documentation with change logs
- Include data quality metrics and acceptable thresholds
Red Flags:
- Undocumented data transformations that affect model reproducibility
- Missing documentation of data collection biases or limitations
- Outdated schema documentation that doesn't reflect current data structure
- No clear ownership or contact information for data sources
Pipeline Documentation
Implementation Questions:
- What pipeline architecture diagrams show data flow and component interactions?
- How do you document pipeline dependencies, configurations, and environment requirements?
- What deployment procedures, rollback processes, and monitoring setups are documented?
- How do you document pipeline failure modes and recovery procedures?
- What performance characteristics and resource requirements are specified?
- How do you maintain runbooks for pipeline operations and troubleshooting?
Key Considerations:
- Create visual pipeline diagrams with clear data flow and decision points
- Document all configuration parameters and environment variables
- Maintain step-by-step deployment and rollback procedures
- Include performance benchmarks and resource utilization guidelines
Red Flags:
- Undocumented pipeline components that create knowledge silos
- Missing documentation of failure recovery procedures
- Outdated deployment documentation that doesn't reflect current processes
- No documentation of pipeline performance characteristics or SLAs
API Documentation
Implementation Questions:
- What API endpoints are documented with request/response schemas and examples?
- How do you document authentication, authorization, and rate limiting requirements?
- What error codes, status messages, and troubleshooting guidance are provided?
- How do you document API versioning, deprecation policies, and migration paths?
- What performance characteristics, latency expectations, and SLAs are specified?
- How do you provide interactive documentation and code examples for different languages?
Key Considerations:
- Use OpenAPI/Swagger specifications for standardized, interactive documentation
- Provide realistic examples with actual data formats and edge cases
- Document all possible error scenarios with appropriate response codes
- Include SDK examples and client libraries for popular programming languages
Red Flags:
- API documentation that becomes stale and doesn't reflect actual endpoints
- Missing documentation of error handling and edge cases
- No examples or code snippets for common integration patterns
- Unclear authentication requirements or API usage policies

Suggested

Decision Records
Implementation Questions:
- What architectural decisions are captured in decision records (model selection, infrastructure choices, design patterns)?
- How do you document the context, alternatives considered, and rationale for each decision?
- What is the process for creating, reviewing, and approving decision records?
- How do you track the consequences and outcomes of architectural decisions over time?
- What template or format ensures consistent decision record quality?
- How do you make decision records discoverable and searchable for future reference?
Key Considerations:
- Use standardized ADR templates with status, context, decision, and consequences
- Document both technical and business rationale for decisions
- Review decisions periodically to assess outcomes and learn from results
- Make decision records part of the code repository for version control
Red Flags:
- Major architectural decisions made without documentation or rationale
- Decision records that are never revisited or updated with outcomes
- No consistent format or process for creating decision records
- Important decisions buried in meeting notes or email threads
Experiment Tracking
Implementation Questions:
- What experiment metadata is tracked (hypothesis, parameters, datasets, results)?
- How do you document experiment setup, configuration, and reproducibility requirements?
- What tools are used for experiment logging and result comparison (MLflow, Weights & Biases)?
- How do you track negative results and failed experiments for future reference?
- What process exists for peer review and validation of experiment results?
- How do you organize and search experiments across different projects and teams?
Key Considerations:
- Implement automated experiment logging with version control integration
- Document clear hypotheses and success criteria before running experiments
- Track resource usage and computational costs for experiment planning
- Create experiment summaries that highlight key insights and learnings
Red Flags:
- Experiments that cannot be reproduced due to missing documentation
- No systematic tracking of negative results or failed approaches
- Experiment logs that are scattered across different tools and platforms
- Missing documentation of experiment rationale and business context

Security & Compliance

Required

Model Security
Implementation Questions:
- How do you protect models from adversarial attacks and evasion techniques?
- What access controls are in place for model artifacts and training data?
- How do you prevent model extraction and intellectual property theft?
- What measures protect against data poisoning and training data manipulation?
- How do you secure model serving infrastructure and APIs?
- What monitoring exists for unusual inference patterns or attacks?
Key Considerations:
- Implement adversarial training and input validation
- Use authentication and authorization for all model endpoints
- Monitor for unusual query patterns that might indicate attacks
- Implement rate limiting and request throttling
Red Flags:
- Model APIs exposed without authentication or rate limiting
- No testing against adversarial examples or attack scenarios
- Model artifacts stored without encryption or access controls
- No monitoring for suspicious inference patterns or data exfiltration
Data Security
Implementation Questions:
- How do you encrypt sensitive data at rest and in transit throughout the ML pipeline?
- What access controls and authentication mechanisms protect training and inference data?
- How do you handle data masking and anonymization for development and testing environments?
- What data retention and deletion policies ensure compliance with privacy regulations?
- How do you audit data access and track data lineage for compliance purposes?
- What backup and disaster recovery procedures exist for critical ML datasets?
Key Considerations:
- Implement end-to-end encryption for data flows and storage
- Use role-based access control with principle of least privilege
- Deploy data loss prevention (DLP) tools to monitor sensitive data movement
- Establish secure data sharing protocols with external partners
Red Flags:
- Sensitive data stored or transmitted without encryption
- Shared credentials or overly permissive data access policies
- No audit trails for data access and modifications
- Production data used directly in development without anonymization
Compliance Checks
Implementation Questions:
- What regulatory frameworks apply to your ML systems (GDPR, HIPAA, SOX, PCI-DSS)?
- How do you conduct ML model audits and validation for compliance purposes?
- What documentation and evidence is maintained for regulatory inspections?
- How do you ensure model decisions are explainable for compliance requirements?
- What processes exist for handling compliance violations or audit findings?
- How do you validate that ML systems meet industry-specific standards and certifications?
Key Considerations:
- Implement automated compliance monitoring and reporting dashboards
- Maintain comprehensive audit trails for all model decisions and data processing
- Establish regular compliance review cycles with legal and regulatory teams
- Create compliance checklists specific to your industry and use cases
Red Flags:
- ML systems deployed without compliance review or approval
- Lack of documentation for regulatory audits and inspections
- Models making decisions without explainability in regulated industries
- No process for addressing compliance gaps or violations
Privacy Protection
Implementation Questions:
- What privacy-preserving ML techniques are implemented (federated learning, differential privacy, homomorphic encryption)?
- How do you ensure individual privacy while maintaining model utility and accuracy?
- What methods are used to prevent membership inference and model inversion attacks?
- How do you handle privacy requirements for cross-border data sharing and processing?
- What privacy impact assessments are conducted before model deployment?
- How do you validate privacy protection measures and monitor for privacy leaks?
Key Considerations:
- Implement differential privacy with appropriate epsilon values for your use case
- Use secure multi-party computation for collaborative ML without data sharing
- Deploy privacy-preserving synthetic data generation for model development
- Establish privacy budgets and governance for privacy-preserving analytics
Red Flags:
- Privacy measures that significantly degrade model performance without justification
- No validation of privacy protection effectiveness or privacy budget management
- Privacy techniques implemented without understanding their limitations
- Insufficient privacy protection for sensitive or regulated data types

Suggested

Attack Prevention
Implementation Questions:
- What types of adversarial attacks are you defending against (evasion, poisoning, extraction, inference)?
- How do you implement adversarial training and robust optimization techniques?
- What input validation and sanitization measures prevent malicious inputs?
- How do you detect and respond to ongoing adversarial attacks in production?
- What rate limiting and behavioral analysis protect against systematic attacks?
- How do you balance robustness measures with model performance and usability?
Key Considerations:
- Implement gradient masking and defensive distillation techniques
- Deploy anomaly detection for unusual query patterns and attack signatures
- Use ensemble methods and randomized defenses to increase attack difficulty
- Establish incident response procedures for confirmed adversarial attacks
Red Flags:
- No defense mechanisms against known adversarial attack vectors
- Attack prevention measures that are easily circumvented or bypassed
- Lack of monitoring for adversarial attack attempts and patterns
- Defensive measures that significantly degrade legitimate user experience
Adversarial Robustness Testing
Implementation Questions:
- What adversarial attack methods are used to test model robustness (FGSM, PGD, C&W)?
- How do you systematically evaluate model vulnerability across different attack scenarios?
- What robustness metrics and benchmarks do you use to measure defense effectiveness?
- How do you test robustness against both white-box and black-box attacks?
- What is your process for addressing vulnerabilities discovered during robustness testing?
- How often do you conduct robustness testing and update defenses?
Key Considerations:
- Implement automated adversarial testing in your CI/CD pipeline
- Use diverse attack libraries and frameworks (CleverHans, Foolbox, ART)
- Test robustness across different input domains and attack strengths
- Establish robustness thresholds and acceptance criteria for deployment
Red Flags:
- Models deployed without any adversarial robustness testing
- Testing only against basic attacks while ignoring advanced techniques
- No systematic approach to fix vulnerabilities found during testing
- Robustness testing performed only once without regular updates
Model Extraction Protection
Implementation Questions:
- What monitoring systems detect suspicious query patterns indicative of model extraction attempts?
- How do you implement query rate limiting and request throttling to prevent extraction?
- What techniques protect against model stealing and reverse engineering (output perturbation, query budgets)?
- How do you detect and respond to systematic evasion attacks against your models?
- What logging and analytics track potential intellectual property theft attempts?
- How do you balance extraction protection with legitimate high-volume usage?
Key Considerations:
- Implement ML-based anomaly detection for unusual query patterns
- Use output randomization and noise injection as extraction countermeasures
- Deploy honeypot queries and detection mechanisms for automated extraction tools
- Establish legal frameworks and terms of service to deter extraction attempts
Red Flags:
- No monitoring for systematic querying or suspicious usage patterns
- Model APIs that allow unlimited queries without rate limiting
- High-confidence predictions that make extraction easier for attackers
- No incident response plan for confirmed model extraction attempts
Differential Privacy
Implementation Questions:
- What differential privacy mechanisms and epsilon values are appropriate for your sensitivity requirements?
- How do you implement federated learning to keep sensitive data distributed while enabling collaborative model training?
- What privacy accounting and budget management systems track cumulative privacy loss?
- How do you validate that differential privacy provides meaningful protection without excessive utility loss?
- What secure aggregation protocols protect individual contributions in federated learning?
- How do you handle client dropout and byzantine behavior in federated learning systems?
Key Considerations:
- Implement both local and global differential privacy depending on trust models
- Use advanced composition theorems to optimize privacy budgets across multiple queries
- Deploy secure multi-party computation for privacy-preserving aggregation
- Establish governance frameworks for privacy parameter selection and review
Red Flags:
- Differential privacy parameters chosen without formal privacy analysis
- Federated learning systems vulnerable to inference attacks on model updates
- No monitoring of cumulative privacy budget depletion over time
- Privacy mechanisms that provide theoretical but not practical protection
AI/ML Security Testing in CI/CD
Implementation Questions:
- What AI/ML-specific security tests are automated in your CI/CD pipelines (adversarial testing, data poisoning detection)?
- How do you conduct regular penetration testing specifically targeting ML systems and APIs?
- What security gates and quality checks prevent vulnerable models from reaching production?
- How do you test for security regressions when updating models or ML infrastructure?
- What threat modeling processes identify potential attack vectors in your ML systems?
- How do you integrate ML security testing with traditional application security testing?
Key Considerations:
- Implement automated adversarial example generation and robustness testing
- Deploy ML-specific vulnerability scanners and security analysis tools
- Establish security performance baselines and regression testing
- Include red team exercises focused on ML system compromise
Red Flags:
- ML systems deployed without any security testing or validation
- Security testing that ignores ML-specific attack vectors and vulnerabilities
- No integration between ML security testing and deployment gates
- Penetration testing that treats ML systems like traditional applications