SRE/DevOps Interview Questions
I’ve been interviewing a lot of candidates for a role on our team recently and decided I should write down some of the questions and tasks I normally ask to get an understanding of the interviewee’s skill level.
These are just some basic questions and are not done, I haven’t had time to come up with some good questions yet.
Questions ranging from the following topics:
- Linux
- Package Management
- Troubleshooting Daemon
- Shell Scripting
- AWS
- Terraform
- Docker
- Kubernetes
- Config Management Tools
- CI/CD
- Secret Creds Best Practice
- Monitoring
Linux
CLI
- Unmounting a directory shows it’s busy, how to find out which PID holds the directory?
$ sudo lsof /mount_point
- What is the command used to show the routing table on a Linux box?
$ route -n
$ netstat -r
- Find all the files in
/tmp
that was modified in the last 5 days and sort them by oldest to newest.$ find /tmp -type f -mtime -5 -exec ls -lhtr {} +
- Setup a simple server/client model communicating on port 8080.
- Server
$ nc -l 8080
- Client
$ nc $__SERVER 8080
- Server
- A careless sysadmin executes the following command:
chmod 444 chmod
- what do you do to fix this?# /lib/ld-linux.so.2 /bin/chmod 755 /bin/chmod
# install -m 755 /bin/chmod /tmp/chmod
# cp /bin/touch /bin/changeme && cat /bin/chmod > /bin/changeme
# ruby -e "File.chmod(0755,'/bin/chmod')"
# python -c "import os; os.chmod('/bin/chmod', 0755)"
Fire-fighting
My elasticsearch cluster is unhealthy and we need to go thru all the elasticsearch config file and update an attribute accordingly and restart the service ASAP …
setup
- Hosts: elastic10 to elastic20
- File:
/etc/sysconfig/elasticsearch
- Attribute:
ES_HEAP_SIZE=<number>g
Task
Write a script or one-liner to login to each host and check if the Attribute is 15g
.
- If attribute is NOT 15g, change to 15g
- Restart the service
Solution
$ for i in elastic{10..20}; do echo "$i ..."; ssh -n $i "grep '^ES_HEAP_SIZE=15g' /etc/sysconfig/elasticsearch || sudo sed -i '/^ES_HEAP_SIZE=/s/=.*g/=15g/' /etc/sysconfig/elasticsearch && sudo service elasticsearch restart"; done
Nginx
Describe how Nginx work with load balancer and application
Package Management
- List all the installed packages on a host.
- RPM
$ rpm -qa
- APT
$ dpkg -l
- RPM
- List all the files of a given package.
- RPM
$ rpm -ql bind-utils
- APT
$ dpkg -L bind-utils
- RPM
- Find out which package supplies a file I’m looking for. Example:
/usr/bin/nslookup
- RPM
$ rpm -qf /usr/bin/nslookup
$ yum whatprovides /usr/bin/nslookup
- APT
$ dpkg -S /usr/bin/nslookup
- RPM
Troubleshooting Daemon
Issue: General MySQL
MySQL server is not starting up, walk me thru the troubleshooting process.
Solution
- Check mysqld log in
/var/log
. - Make sure owner is correct for the mysql directory.
- Ensure socket file exist.
Issue: Port 3306
Seems like mysql is not able to bind to port 3306, what are the steps fix the issue.
Solution
- Ensure theres not another mysql service running
$ ps -ef | grep mysql
- Check port 3306 via
netstat
orlsof
$ sudo netstat -naptu | grep 3306 | grep LISTEN
$ sudo lsof -i :3306
- Once we know which process is using the port, we can kill the PID and restart mysql.
Shell Scripting
I have bunch of elasticsearch-*.yml
files in /tmp
in the following format:
indices.breaker.fielddata.limit: 30%
indices.breaker.request.limit: x%
indices.breaker.total.limit: 75%
Task
- Write a script to update all of them
- If
indices.breaker.request.limit
is greater than 50%, replace the value to 50%
- If
- All the changes should go to STDOUT
Solution
for i in /tmp/elasticsearch-*.yml; do
request=$(grep "^indices.breaker.request.limit:" $i | awk '{print $2}' | tr -d '%')
echo "$i: request: $request"
if [ "$request" -gt 50 ]; then
echo greater
sed -i "/^indices.breaker.request.limit/s/${request}%$/50%/" $i
fi
done
echo "post run -"
grep indices.breaker.request.limit /tmp/elasticsearch-*.yml
AWS
AWS-CLI
- Look up an EC2 instance.
$ aws ec2 describe-instances --instance-ids $_INSTANCE_ID
- Reboot an EC2 instance.
$ aws ec2 reboot-instances --instance-ids $_INSTANCE_ID
- Copy a directory from S3 bucket to
/tmp
.$ aws s3 cp --recursive s3://$_BUCKET_ID/directory /tmp/directory
ECS
- What are the main components of an ECS Cluster?
- Task Definitions
- Services
- Cluster Instances
Terraform
- What are modules in Terraform?
- Self-contained packages of Terraform configurations that are managed as a group. Modules are used to create reusable components in Terraform as well as for basic code organization.
- What does the
backend.tf
file do? and what are the benefits?- A “backend” determines how state is loaded and how an operation such as apply is executed. This abstraction enables non-local file state storage, remote execution, etc.
- Benefits are:
- State locking to prevent corruption.
- Remote operations.
- How to bring resources created by some other means and bring it under Terraform management?
$ terraform import
- Describe the typical “resources” required to setup a simple EC2 cluster.
- aws_iam_role
- aws_iam_role_policy
- aws_iam_instance_profile
- aws_security_group
- aws_instance
- What is the protential issue with the following resource?
resource "aws_ecs_service" "foo" { name = "foo" cluster = "${aws_ecs_cluster.foo.id}" task_definition = "${aws_ecs_task_definition.foo.arn}" desired_count = 1 deployment_minimum_healthy_percent = 50 deployment_maximum_percent = 100 }
- With a
desired_count = 1
, thedeployment_minimum_healthy_percent
can’t be 50% of 1. It has to be at least 100%.
- With a
- Describe best practice building infrastructure in multiple environments.
├── environments/ │ ├── dev-us-east-1/ │ ├── ... │ ├── prd-us-east-1/ │ ├── stg-eu-central-1/ │ ├── stg-eu-central-1-networking/ │ └── stg-us-east-1/ └── modules/ ├── aws_cloudfront/ ├── bastion/ ├── consul/ ├── ... └── vpc/
Docker
Dockerfile
- Describe the difference between COPY and ADD in a Dockerfile.
COPY
takes in a src and destination. It only lets you copy in a local file or directory from your host (the machine building the Docker image) into the Docker image itself.ADD
lets you do that too, but it also supports 2 other sources. First, you can use a URL instead of a local file/directory. Secondly, you can extract a tar file from the source directly into the destination.
- Describe the difference between RUN, CMD, and ENTRYPOINT in a Dockerfile.
RUN
happens at build time. When you build your Docker image, Docker will read in yourRUN
command and build it into your image as a separate image layer.CMD
happens at run time. This will likely be calling some type of process, such as nginx, bash or whatever process your Docker image runs. This does not create a separate image layer.ENTRYPOINT
can not be edited from command line, unlikeCMD
.
Troubleshooting disk usage
One of the containers on the host is using up the entire disk, find out the rogue container and stop/kill it.
Solution
- Find the culprit directory in
/var/lib/docker/overlay2
.# du -sh * | grep G
- Inspect all the running docker containers and compare to the directory above.
# docker ps | awk '{print $1}' | grep -v CONTAINER | while read line; do echo $line; docker inspect $line | grep WorkDir | grep $__DIR__; done
- Once we have the container that is using up all the disk space, kill/stop and remove the container.
# docker kill $__CONTAINER; docker rm $__CONTAINER; df -h
Troubleshooting container
- Obtain shell access to a running container.
# docker exec -it $__CONTAINER bash
- Display the logs from the container.
# docker logs --follow $__CONTAINER
- Properly override the ENTRYPOINT using docker run.
$ docker run --entrypoint "/bin/ls" $__CONTAINER -alh /root
- Measure your containers’ resources.
# docker stats
- Remove unused data (stopped containers/dangling images/build cache/etc).
# docker system prune
docker-compose
How are each services in the docker-compose.yml
file talk to each other?
Solution
The services are defined under depends_on:
depends_on:
- elasticsearch172
- mongo32
- redis26
Docker Swarm
- List all the nodes in the swarm cluster.
# docker node ls
- Start 5 instances of a service.
# docker service create --replicas 5 --name $__NAME $__CONTAINER
- Scale the service to 20 instances.
# docker service scale $__NAME=20
Kubernetes
- What is a pod?
- A Pod (as in a pod of whales or pea pod) is a group of one or more containers (such as Docker containers), with shared storage/network, and a specification for how to run the containers.
- What is a deployment?
- A Deployment provides declarative updates for Pods and ReplicaSets.
- What is a stateful set?
- StatefulSet is the workload API object used to manage stateful applications.
- When to use Deployment vs StatefulSet?
- Give an example of a stateful service.
- MongoDB/ElasticSearch/Redis
- Give an example of a stateful service.
- How to update an application without downtime?
strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0
- What is a namespace? When to use it?
- What is the difference between secrets and configmap, when to use them?
- Can you name some kubernetes commands?
- kubectl config get-contexts
- kubectl apply -f OBJECT.yaml
- kubectl get all,hpa,sa,ns,ing,sc,pvc,secrets -o wide
- kubectl describe service/SERVICE-NAME
- kubectl get deployment SERVICE-NAME -o yaml
- kubectl exec -it POD-NAME CMD
- kubectl logs -f POD-NAME
Config Management Tools
- Describe the main components of any of the CMS.
- Chef/Ansible/Puppet
- Invoke SSH commands (in parallel) on a subset of nodes.
- Chef
$ knife ssh $_SEARCH_QUERY $_SHELL_COMMAND
- Ansible
$ ansible $_CLUSTER -a $_SHELL_COMMAND -f 10
- Puppet
$ mco rpc shell start command=$_SHELL_COMMAND server=$_CLUSTER
- PDSH
$ WCOLL=$_NODES pdsh -f 5 $_SHELL_COMMAND
- Chef
- How to speed up runtime.
CI/CD
- Jenkins
- What are the various ways in which build can be scheduled in Jenkins?
- By source code management commits.
- After completion of other builds.
- Can be scheduled to run at specified time (crons).
- Manual Build Requests.
- What are the various ways in which build can be scheduled in Jenkins?
- Codeship
- What are the two
yml
files in order for codeship to work?codeship-services.yml
codeship-steps.yml
- What file is used to encrypt/decrypt content for a project?
codeship.aes
- What is the command to trigger a
codeship
job locally?$ jet steps
- What are the two
- Deployment procedures
- What is a
Canary
deployment?- Rolling out releases to a subset of users or servers. The idea is to first deploy the change to a small subset of servers, test it, and then roll the change out to the rest of the servers.
- What is a
Blue/Green
deployment?- A technique that reduces downtime and risk by running two identical production environments. At any time, only one of the environments is live, with the live environment serving all production traffic. For this example, Blue is currently live and Green is idle.
- What is a
Secret Creds Best Practice
What is the best way to store secret creds for our application to load during runtime?
- What service/backend to use
- How to implement for multiple environments
Monitoring
Any experiences with New Relic, Splunk, PagerDuty, Nagios, etc?