Jai's Blog : 2. Linux [Interview Question and Answers]

Troubleshooting Real-World Scenarios

Scenario 1: Application Performance Degradation Problem: Users report that a web application has become significantly slower. Solution Steps:

Monitor System Resources: Use tools like top, htop, vmstat, and iotop to check CPU, memory, and I/O usage.
Check Application Logs: Look for error messages or warnings in the application's logs (/var/log/appname).
Database Performance: If the application relies on a database, check for slow queries and optimize them.
Network Latency: Use ping and traceroute to check for network delays between the server and its clients or resources.
Update and Optimize: Ensure the application and all dependencies are up-to-date. Consider implementing caching or other performance improvements.

Scenario 2: System Unreachable via SSH Problem: A remote server is not accessible via SSH. Solution Steps:

Check Network Connectivity: Use ping to ensure the server is reachable over the network.
Verify SSH Service: Access the server through an alternate method (e.g., console access) to check if the SSH service is running (systemctl status sshd).
Firewall Rules: Ensure no firewall rules are blocking SSH access. Check with iptables -L or ufw status.
SSH Configuration: Verify the SSH configuration file (/etc/ssh/sshd_config) for any incorrect settings that may prevent access.
Log Analysis: Review /var/log/auth.log for SSH connection attempts and possible reasons for failure.

Scenario 3: Disk Space Suddenly Full Problem: A server's disk space usage spikes unexpectedly, impacting services. Solution Steps:

Identify Large Files/Directories: Use du -sh * in various directories to find large files. Tools like ncdu can help visualize disk usage.
Log Files: Check for unusually large log files in /var/log and consider implementing log rotation with logrotate.
Temporary Files: Clear out /tmp and other directories of temporary files that may no longer be needed.
Audit Deleted Files: Check for deleted files still in use by processes with lsof | grep deleted and restart the associated services to free up space.
Backup and Cleanup: Ensure unnecessary backups or old data are not consuming space. Implement a cleanup strategy.

Scenario 4: Service Fails to Start After Update Problem: After updating a package, its associated service fails to start. Solution Steps:

Check Service Status: Use systemctl status service_name to get error messages.
Review Logs: Look at the service's log files and journalctl -xe for details.
Dependency Issues: Ensure all dependencies are correctly installed and not broken by the update.
Configuration Compatibility: Check if the update requires changes in the service's configuration files.
Rollback: If immediate resolution is not possible, consider rolling back the update and reporting the issue to the package maintainers.

Scenario 5: High CPU Load Problem: The system reports a high CPU load, affecting performance. Solution Steps:

Identify Process: Use top or htop to identify processes consuming excessive CPU resources.
Analyze Process Activity: Determine why the process is using high CPU (e.g., infinite loop in code, unexpected high traffic).
Optimize/Update: Optimize the application or script if possible, or check for updates that may address the issue.
Resource Limits: Implement CPU usage limits for processes using cpulimit or control groups (cgroups).
Review Scheduling: If the load is due to cron jobs or scheduled tasks, spread them out or optimize their execution times.

Scenario 6: Database Connectivity Issues Problem: An application is unable to connect to its database, resulting in service downtime. Solution Steps:

Verify Database Service Status: Ensure the database service is running (systemctl status mysql or postgresql).
Check Network Issues: Confirm network connectivity between the application server and the database host.
Review Database Logs: Look for connection errors or warnings in the database logs (/var/log/mysql/error.log or /var/log/postgresql).
Test Connection Manually: Use database client tools (mysql or psql) to test connectivity using the credentials configured for the application.
Configuration Review: Ensure the database configuration (my.cnf or postgresql.conf) allows connections from the application server and that the application's configuration files contain the correct database credentials.

Scenario 7: Email Service Not Sending Emails Problem: A server's email service (Postfix, Sendmail) is not sending emails, affecting notifications and alerts. Solution Steps:

Service Status: Check the status of the email service (systemctl status postfix/sendmail).
Mail Queue: Inspect the mail queue (mailq) for stuck emails and error messages that can indicate the cause.
Log Analysis: Review the mail logs (/var/log/mail.log or /var/log/maillog) for errors such as connection timeouts or authentication failures.
Configuration Check: Verify the email service configuration for correct SMTP settings, relay hosts, and authentication details.
External Blocklist Check: Ensure your server's IP address is not on any DNS-based blocklists (RBLs) if sending to external recipients.

Scenario 8: SSL/TLS Certificate Errors Problem: Users report SSL/TLS certificate errors when accessing a web service, indicating potential security warnings. Solution Steps:

Certificate Validation: Check the certificate's validity dates (openssl x509 -in cert.pem -text).
Chain of Trust: Verify the entire certificate chain is correctly installed and there are no missing intermediate certificates.
Server Configuration: Review the web server's SSL/TLS configuration to ensure it's pointing to the correct certificate files.
Browser Compatibility: Check for compatibility issues with older browsers or clients, especially if using newer encryption algorithms.
Renewal Errors: If using Let's Encrypt or another ACME-based CA, check for errors in the renewal process (cron jobs or systemd timers).

Scenario 9: Filesystem Corruption Detected Problem: The system reports errors indicating possible filesystem corruption on one of the disks. Solution Steps:

Immediate Backup: If not already available, make an immediate backup of critical data, if possible, to prevent data loss.
Filesystem Check: Unmount the affected filesystem (if not the root filesystem) and use fsck to check and repair filesystem errors.
Hardware Diagnostics: Run smart diagnostics (smartctl -t long /dev/sdX) to check for underlying hardware issues with the disk.
Review System Logs: Check /var/log/syslog or dmesg for I/O errors or hardware issues leading to filesystem corruption.
Replace Hardware: If hardware faults are detected, plan for a disk replacement and restore data from backups.

Scenario 10: Unexpected System Reboots Problem: The server is experiencing unexpected reboots, leading to service instability. Solution Steps:

Check Logs: Review /var/log/syslog, /var/log/messages, and /var/log/kern.log for entries just before the reboot, looking for kernel panics or hardware issues.
Hardware Tests: Run comprehensive hardware diagnostics to check for overheating, RAM faults, or power supply issues.
External Monitoring: Set up external monitoring to capture the exact time of reboots and correlate this with internal logs and possible external factors (e.g., power outages).
Update System: Ensure the system and all drivers are up to date to rule out software bugs causing the reboots.
Isolate Changes: Recall any recent changes to the system (hardware additions, software updates) that could be related to the issue.

Scenario 11: Network Performance Degradation Problem: Critical services are experiencing intermittent network latency and packet loss, impacting user experience. Solution Steps:

Baseline Comparison: Compare current network performance metrics against historical baselines to identify specific patterns or anomalies.
Advanced Monitoring Tools: Utilize advanced network monitoring tools (e.g., iperf, nmap, Wireshark) to analyze traffic flow and pinpoint congestion points or packet drops.
Quality of Service (QoS) Configuration: Review and adjust QoS settings on network devices to prioritize critical application traffic.
Network Topology Review: Examine the network topology for any recent changes that might contribute to the issues, such as loops without proper spanning tree protocol configurations.
ISP and External Factors: Coordinate with ISP and external partners to rule out external causes. Utilize traceroute or mtr to identify network latency at hop-level granularity.

Scenario 12: Cluster Service Failover Not Working Problem: In a high-availability cluster setup, services are not failing over smoothly between nodes during planned maintenance or unexpected outages. Solution Steps:

Cluster Configuration Review: Thoroughly review cluster configuration files for any misconfigurations or inconsistencies between nodes.
Logs and Cluster Reports: Analyze cluster logs (/var/log/pacemaker.log, /var/log/corosync/corosync.log) and use cluster reporting tools (crm_report) to gather detailed insights about the failover issues.
Resource Constraints: Check for resource constraints or dependencies that could prevent services from starting on failover nodes, including IP address conflicts, storage access issues, or incorrect service dependencies.
Simulate Failovers: Safely simulate failover scenarios to observe behaviors and identify conditions not met for a successful failover.
Update and Patch: Ensure all cluster software components are up to date with the latest patches that might resolve known failover issues.

Scenario 13: Distributed Database Synchronization Problems Problem: A distributed database system (e.g., Galera Cluster, MongoDB replica set) is experiencing synchronization delays or conflicts, leading to data inconsistencies. Solution Steps:

Synchronization Metrics: Monitor synchronization metrics specific to the database system to identify lag or failed synchronization attempts.
Network Latency: Check for network latency or interruptions between database nodes that could affect synchronization.
Database Logs: Examine database logs for errors related to replication or synchronization, looking for patterns or specific error messages.
Configuration Validation: Validate the database configuration for replication settings, ensuring they are optimized for your network and data size.
Conflict Resolution Policies: Review and adjust conflict resolution policies or mechanisms to handle data inconsistencies more effectively.

Scenario 14: Kernel Panic on Production Server Problem: A production server experiences a kernel panic, causing an unexpected reboot and service downtime. Solution Steps:

Crash Dump Analysis: Configure and use kdump or a similar tool to capture and analyze crash dumps, identifying the root cause of the kernel panic.
System and Application Logs: Check /var/log/messages, /var/log/syslog, and application logs for any anomalies or errors preceding the panic.
Hardware Diagnostics: Perform thorough hardware diagnostics to rule out issues such as faulty memory, overheating CPUs, or disk failures.
Kernel and Drivers: Ensure the kernel and all drivers are up to date. Investigate any recently installed kernel modules or drivers that could be causing conflicts.
System Changes Review: Audit recent changes to the system, including updates, configuration changes, or new software installations, that might have introduced instability.

Scenario 15: Secure Shell (SSH) Key Authentication Failing Problem: SSH key-based authentication is failing for a remote access setup, requiring fallback to password authentication. Solution Steps:

Permissions Check: Ensure correct permissions on the .ssh directory (700) and the authorized_keys file (600) on the server side.
SSH Configuration: Review the SSH daemon configuration (/etc/ssh/sshd_config) for directives like PubkeyAuthentication, AuthorizedKeysFile, and make sure they are correctly set.
Client Configuration and Key Format: Verify the SSH client configuration and ensure the key format is supported by the server. Recent changes in SSH may deprecate older key formats.
Verbose SSH Output: Use ssh -vvv for verbose output to get more detailed error messages during the authentication process.
System and Security Logs: Examine /var/log/auth.log or /var/log/secure for detailed error messages related to the failed key authentication attempts.

Scenario 16: Sudden Increase in Load on a Web Application Problem: A production web application experiences a sudden increase in load, leading to slow response times and timeouts. Solution Steps:

Performance Monitoring: Use tools like top, htop, vmstat, and iotop, along with web server and application performance monitoring (APM) tools, to identify bottlenecks.
Log Analysis: Review web server logs (access.log and error.log) and application logs for any unusual patterns or errors that coincide with the increase in load.
Scaling: If the infrastructure supports it, temporarily scale up resources (CPU, memory) or scale out by adding more instances behind a load balancer.
Caching and Optimization: Implement or optimize caching strategies to reduce the load on application servers and databases.
Traffic Analysis: Use tools like nload or iftop to analyze incoming traffic for patterns that might indicate a DDoS attack or web scraping, and apply rate limiting or IP blocking as necessary.

Scenario 17: Database Lock Contention Problem: Reports of database operations timing out or slowing down significantly, potentially due to lock contention. Solution Steps:

Identify Locks: Use database-specific tools or commands to identify current locks, lock wait times, and transactions causing the locks.
Query Optimization: Analyze and optimize slow-running queries that contribute to lock contention.
Transaction Management: Review application code for transaction scopes to ensure transactions are as short as possible, reducing lock times.
Configuration Tuning: Adjust database configuration settings related to locking mechanisms and concurrency to improve performance.
Hardware Consideration: If contention is due to I/O bottlenecks, consider hardware upgrades or adjustments, such as faster disks or additional memory for caching.

Scenario 18: SSL Certificate Renewal Failure Problem: Automated SSL certificate renewal for a web service fails, risking service downtime due to an expired certificate. Solution Steps:

Manual Renewal Attempt: Manually trigger the renewal process to observe any errors or issues that are not evident from logs.
Log Inspection: Check the renewal process logs for detailed error messages or hints about the failure cause.
Permissions and Ownership: Verify that file permissions and ownership allow the renewal process to access and modify necessary files.
Firewall and Network Configuration: Ensure that firewall rules or network configurations do not block access to the Certificate Authority (CA) or challenge response paths.
Dependency and Tool Updates: Ensure that all dependencies and tools involved in the renewal process are up to date, as outdated software can cause compatibility issues.

Scenario 19: Inter-service Communication Failure in Microservices Architecture Problem: Services in a microservices architecture are intermittently unable to communicate, leading to failed requests and degraded performance. Solution Steps:

Service Discovery Health Check: Ensure that the service discovery mechanism is functioning correctly and that all services are correctly registered.
Network Troubleshooting: Use traceroute, ping, and network capture tools like tcpdump to diagnose network connectivity issues between services.
Inspect Service Logs: Look for errors in service logs that indicate communication failures, such as timeouts, connection refused, or DNS resolution failures.
API Gateway and Load Balancer Configuration: Check configurations for any changes or issues that could affect routing and load balancing between services.
Circuit Breaker and Retry Logic: Implement or review existing circuit breaker patterns and retry logic to gracefully handle communication failures and prevent cascading failures.

Scenario 20: Memory Leak Leading to System Instability Problem: A critical application is suspected of having a memory leak, causing gradual degradation in system performance and stability. Solution Steps:

Memory Usage Monitoring: Use tools like top, htop, or valgrind to monitor memory usage over time and identify leaking processes.
Application Profiling: Utilize application profiling tools specific to the application's programming language to pinpoint the source of the leak.
Review Recent Changes: Analyze recent code changes that could have introduced the memory leak.
Resource Limit Enforcement: Apply resource limits using cgroups or the application's configuration to contain the leak's impact until a fix can be deployed.
Leak Patching and Testing: Once identified, patch the memory leak, thoroughly test the fix, and deploy it during a maintenance window to minimize impact.

Scenario 21: Intermittent Kernel OOPS in a Production Server Problem: A critical production server sporadically crashes with a kernel OOPS message, affecting service availability. Solution Steps:

Analyze Crash Dumps: Collect and analyze kernel crash dumps using kdump and crash utility to identify the cause of the kernel OOPS.
Review System Logs: Examine /var/log/messages and dmesg logs for patterns or messages preceding the crashes.
Hardware Diagnostics: Run comprehensive diagnostics to check for hardware issues, such as faulty memory (using memtest86+), CPU, or disk errors.
Kernel Updates: Check if the current kernel version has known bugs related to the OOPS message; upgrade to a stable kernel version if available.
Third-party Drivers and Modules: Identify and update or remove third-party kernel modules and drivers that might be causing instability.

Scenario 22: Distributed File System Performance Bottleneck Problem: A distributed file system used by multiple applications exhibits severe performance degradation under load. Solution Steps:

Network Throughput and Latency: Test network throughput and latency between nodes using tools like iperf3 and ping. High latency or low throughput could indicate network issues.
Disk I/O Bottleneck: Utilize iostat and iotop to identify disk I/O bottlenecks on nodes. SSDs may be required for high I/O workloads.
File System Health Check: Perform file system checks and optimizations specific to the distributed file system in use (e.g., rebalancing data in HDFS).
Tuning Parameters: Adjust file system and network kernel parameters to optimize for the specific workload and deployment architecture.
Workload Analysis: Analyze access patterns and workload types. Implement caching or data locality optimizations to reduce cross-network file accesses.

Scenario 23: High Availability Cluster Split-Brain Issue Problem: A high availability (HA) cluster experiences a split-brain condition, causing data inconsistency and service disruption. Solution Steps:

Cluster State Examination: Use cluster management tools (e.g., pcs status, crm_mon) to assess the current state and identify the split-brain condition.
Network Diagnostics: Check the inter-cluster communication links for failures or misconfigurations that may have led to the split-brain.
Fencing and Quorum Configuration: Review and correct fencing (STONITH) configurations and quorum settings to prevent future split-brain scenarios.
Synchronize Data: Manually synchronize data between cluster nodes to resolve inconsistencies, following the cluster's recommended practices.
Update and Test Cluster Configuration: Ensure all cluster software is up to date and test cluster failover mechanisms to verify that split-brain conditions are correctly handled.

Scenario 24: SSL Handshake Failures on Secure Websites Problem: Users report SSL handshake failures when accessing company websites, leading to trust issues and blocked access. Solution Steps:

Certificate and Chain Verification: Verify the SSL certificate chain for completeness and validity using openssl s_client -connect hostname:port.
Cipher Suite Compatibility: Ensure the server's SSL configuration includes cipher suites compatible with client browsers, especially older versions.
Protocol Version Support: Check that the server supports TLS protocol versions that are widely used by clients, considering both security and compatibility.
Server Configuration Optimization: Use tools like sslscan and online services (e.g., SSL Labs) to test and optimize the server's SSL/TLS configuration.
Review Logs for Errors: Examine web server and application logs for specific SSL handshake error messages that can pinpoint the issue (e.g., expired certificates, required client certificates).

Scenario 25: Inconsistent System Time Causing Authentication Failures Problem: Servers in a distributed environment experience authentication failures, suspected to be caused by time synchronization issues. Solution Steps:

NTP/Chrony Service Status: Check the status and configuration of NTP or Chrony services on all affected servers to ensure they're synchronized to the same time source.
Time Offset Analysis: Use ntpq -p or chronyc sources to analyze time sources and offsets from the synchronized time source.
Hardware Clock Synchronization: Ensure the system's hardware clock (RTC) is synchronized with the system time to maintain time across reboots.
Time Zone Consistency: Verify that all servers are configured with the correct time zone and that there are no discrepancies causing the authentication issues.
Kerberos Configuration Review: For Kerberos-based authentication systems, ensure clock skew (clockskew in krb5.conf) is appropriately configured to tolerate minor time differences.

Scenario 26: Security Breach via Compromised Service Account Problem: Anomalies detected in system behavior and network traffic suggest a security breach, possibly through a compromised service account. Solution Steps:

Immediate Account Suspension: Temporarily disable the suspected compromised account to halt any unauthorized activities.
Audit Logs: Analyze authentication logs (/var/log/auth.log), service-specific logs, and network traffic logs to trace the origin of the breach and assess the extent of access.
Password and Key Review: Change passwords and SSH keys for affected accounts and any accounts with similar access levels. Review SSH authorized keys for unauthorized entries.
Forensic Analysis: Use forensic tools to analyze system changes, including unexpected files, modified binaries, or rootkits.
Post-Incident Review: After resolving the immediate threat, conduct a thorough review to understand the breach's cause. Implement enhanced security measures, such as two-factor authentication and stricter access controls.

Scenario 27: Network File System (NFS) Performance Degradation Problem: Users report slow access times and poor performance when accessing files over an NFS mount. Solution Steps:

Network Diagnostics: Check the network performance between the NFS client and server using ping, iperf3, or mtr to identify any latency or packet loss issues.
Server Load Monitoring: Monitor the NFS server's resource usage (CPU, memory, disk I/O) to identify bottlenecks.
NFS Version and Options: Ensure both NFS server and client are using an optimal NFS version and mount options (rsize, wsize, noatime) for the workload.
Concurrency and Locking Issues: Investigate if performance issues are due to high concurrency or file locking conflicts. Adjust NFS server configurations to handle higher loads more efficiently.
Client-Side Caching: Implement or optimize client-side caching mechanisms to reduce load on the NFS server and improve performance for frequently accessed files.

Change Management Steps with scenario issues

Scenario 28: Critical Security Vulnerability Patching Incident: A critical security vulnerability is identified in a software component used widely across your infrastructure. Change Management Steps:

Risk Assessment: Immediately assess the vulnerability's impact on your environment.
Patch Testing: In a controlled testing environment, apply the patch to the affected software to ensure it doesn't introduce new issues.
Change Approval: Document the change and obtain approval from the Change Advisory Board (CAB) if required, emphasizing the urgency due to the security implications.
Scheduled Patching: Schedule and announce a maintenance window to apply the patch, minimizing operational impact.
Implementation and Monitoring: Apply the patch following the approved change plan, closely monitor the system for any unexpected behavior. Problem Management Steps:
Root Cause Analysis: Investigate how the vulnerable component was introduced and why the vulnerability wasn't detected earlier.
Process Improvement: Based on the root cause analysis, update security review and patch management processes to prevent similar issues.
Knowledge Sharing: Document the incident, actions taken, and lessons learned to improve organizational knowledge and preparedness for future vulnerabilities.

Scenario 29: Database Outage Due to Failed Replication Incident: Database replication failure leads to an outage, affecting applications relying on the database for real-time data. Change Management Steps:

Immediate Diagnosis: Quickly identify and isolate the cause of the replication failure.
Change Planning: Develop a plan to restore replication, including data synchronization without causing data loss or corruption.
Change Implementation: Execute the change plan to restore database operations, ensuring all stakeholders are informed about the potential impact during the restoration process.
Post-Implementation Review: After the change is implemented, verify database integrity and replication functionality. Problem Management Steps:
Identify Underlying Causes: Conduct a thorough analysis to determine why replication failed, including hardware, software, and configuration aspects.
Corrective Actions: Based on the analysis, implement corrective actions to prevent recurrence, which may include hardware upgrades, configuration adjustments, or improved monitoring.
Documentation and Training: Update documentation and conduct training sessions as needed to ensure the team is prepared to handle similar issues more effectively in the future.

Scenario 30: Service Downtime Caused by Configuration Error Incident: A routine update to application configuration files results in unexpected downtime for a critical service. Change Management Steps:

Incident Identification and Rollback: Quickly rollback the configuration change to restore service functionality.
Review and Approve Corrective Change: Analyze the failed change to understand the error and develop a corrected configuration. Review and approve this change following standard procedures.
Implement and Monitor: Apply the corrected configuration during a defined maintenance window, closely monitor the service for stability and performance issues. Problem Management Steps:
Root Cause Analysis: Investigate the change process to understand how the configuration error was introduced and why it wasn't caught in testing.
Process Improvement: Refine change management procedures, enhance validation checks for configuration changes, and improve testing protocols to catch similar errors.
Education and Prevention: Share detailed findings and new procedures with the team to prevent similar incidents, emphasizing the importance of thorough testing and review for all changes.

Scenario 31: Intermittent Network Connectivity Issues Incident: Users report intermittent network connectivity issues, affecting access to multiple services. Change Management Steps:

Immediate Investigation: Employ network diagnostic tools to identify potential causes, such as faulty hardware or configuration errors.
Change Planning: Plan necessary changes to network infrastructure or configuration to address identified issues.
Staged Implementation: Implement changes in a staged approach, if possible, to minimize impact, starting with non-production environments. Problem Management Steps:
Detailed Analysis: Conduct a comprehensive analysis of network logs, configurations, and hardware to identify the underlying problem causing the connectivity issues.
Long-term Solutions: Based on the analysis, develop long-term solutions, which may include hardware replacements, topology changes, or enhanced monitoring and alerting capabilities.
Review and Documentation: Document the incident's details, the analysis process, and the steps taken to resolve the issue, updating network design and maintenance guidelines to incorporate lessons learned.

Jai's Blog

Friday, March 29, 2024

2. Linux [Interview Question and Answers] - Troubleshooting

Troubleshooting Real-World Scenarios

No comments:

Post a Comment

Bash Scripting Interview Question and Answers