介紹 :
各位,在 it 運營中,監視服務器指標(例如 cpu/內存和磁盤或文件系統的利用率)是一項非常通用的任務,但如果任何指標被觸發為關鍵指標,則需要專門人員通過以下方式執行一些基本故障排除:登錄服務器并找出使用的最初原因,如果該人收到多個相同的警報,導致無聊且根本沒有生產力,則他必須多次執行該操作。因此,作為一種解決方法,可以開發一個系統,一旦觸發警報,該系統就會做出反應,并通過執行一些基本的故障排除命令來對這些實例采取行動。只是總結問題陳述和期望 –
問題陳述:
開發一個能夠滿足低于預期的系統 –
每個 ec2 實例都應該由 cloudwatch 監控。
一旦觸發警報,就必須有一些東西可以登錄到受影響的 ec2 實例并執行一些基本的故障排除命令。
然后,創建一個 jira 問題來記錄該事件,并在評論部分添加命令的輸出。
然后,發送一封自動電子郵件,其中提供所有警報詳細信息和 jira 問題詳細信息。
架構圖:
先決條件:
-
ec2 實例
cloudwatch 警報
eventbridge 規則
lambda 函數
jira 賬戶
簡單的通知服務
實施步驟:
a. cloudwatch 代理安裝和配置設置:
打開 systems manager 控制臺并單擊“文檔”
搜索“aws-configureawspackage”文檔并通過提供所需的詳細信息來執行。
包名稱 = amazoncloudwatchagent
安裝后,需要根據配置文件配置 cloudwatch 代理。為此,請執行 amazoncloudwatch-manageagent 文檔。另外,請確保 json cloudwatch 配置文件存儲在 ssm 參數中。
一旦您看到指標正在向 cloudwatch 控制臺報告,請為 cpu 和內存利用率等創建警報。
b.設置eventbridge規則:
為了跟蹤警報狀態的變化,這里,我們稍微定制了模式來跟蹤警報狀態從 ok 到 alarm 的變化,而不是反向變化。然后,將此規則添加到 lambda 函數作為觸發器。
{ "source": ["aws.cloudwatch"], "detail-type": ["cloudwatch alarm state change"], "detail": { "state": { "value": ["alarm"] }, "previousstate": { "value": ["ok"] } } }
登錄后復制
c.創建 lambda 函數以在 jira 中發送電子郵件和記錄事件:
此 lambda 函數是為由 eventbridge 規則觸發的多個活動創建的,并作為使用 aws sdk(boto3) 添加的目標 sns 主題。一旦觸發 eventbridge 規則,就會將 json 事件內容發送到 lambda,該函數通過該函數捕獲多個詳細信息以不同的方式進行處理。
到目前為止,我們已經研究了兩種類型的警報 – i。 cpu 利用率和 ii.內存利用率。一旦這兩個警報中的任何一個被觸發并且警報狀態從 ok 更改為 alarm,就會觸發 eventbridge,這也會觸發 lambda 函數來執行表單代碼中提到的那些任務。
lambda 先決條件:
我們需要導入以下模塊才能使代碼正常工作 –
>> 操作系統
>> 系統
>> json
>> boto3
>> 時間
>> 請求
注意: 從上面的模塊中,除了“requests”模塊之外,其余的都默認在 lambda 底層基礎設施中下載。 lambda 不支持直接導入“requests”模塊。因此,首先,通過執行以下命令將請求模塊安裝在本地計算機(筆記本電腦)的文件夾中 –
pip3 install requests -t <directory path> --no-user </directory>
登錄后復制
_之后,這將被下載到您執行上述命令的文件夾或您想要存儲模塊源代碼的文件夾中,這里我希望 lambda 代碼正在您的本地計算機中準備。如果是,則使用 module.txt 創建整個 lambda 源代碼的 zip 文件。之后,將 zip 文件上傳到 lambda 函數。
所以,我們在這里執行以下兩個場景 –
1. cpu 利用率 – 如果觸發 cpu 利用率警報,則 lambda 函數需要獲取實例并登錄到該實例并執行前 5 個高消耗進程。然后,它將創建一個 jira 問題并在評論部分添加流程詳細信息。同時,它將發送一封電子郵件,其中包含警報詳細信息和 jira 問題詳細信息以及流程輸出。
2.內存利用率 – 與上面相同的方法
現在,讓我重新構建 lambda 應該執行的任務細節 –
-
登錄實例
執行基本故障排除步驟。
創建 jira 問題
向收件人發送包含所有詳細信息的電子郵件
場景 1:當警報狀態從 ok 更改為 alarm 時
第一組(定義cpu和內存函數):
################# importing required modules ################ ############################################################ import json import boto3 import time import os import sys sys.path.append('./python') ## this will add requests module along with all dependencies into this script import requests from requests.auth import httpbasicauth ################## calling aws services ################### ########################################################### ssm = boto3.client('ssm') sns_client = boto3.client('sns') ec2 = boto3.client('ec2') ################## defining blank variable ################ ########################################################### cpu_process_op = '' mem_process_op = '' issueid = '' issuekey = '' issuelink = '' ################# function for cpu utilization ################ ############################################################### def cpu_utilization(instanceid, metric_name, previous_state, current_state): global cpu_process_op if previous_state == 'ok' and current_state == 'alarm': command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -5' print(f'impacted instance id is : {instanceid}, metric name: {metric_name}') # start a session print(f'starting session to {instanceid}') response = ssm.send_command(instanceids = [instanceid], documentname="aws-runshellscript", parameters={'commands': [command]}) command_id = response['command']['commandid'] print(f'command id: {command_id}') # retrieve the command output time.sleep(4) output = ssm.get_command_invocation(commandid=command_id, instanceid=instanceid) print('please find below output -\n', output['standardoutputcontent']) cpu_process_op = output['standardoutputcontent'] else: print('none') ################# function for memory utilization ################ ############################################################### def mem_utilization(instanceid, metric_name, previous_state, current_state): global mem_process_op if previous_state == 'ok' and current_state == 'alarm': command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -5' print(f'impacted instance id is : {instanceid}, metric name: {metric_name}') # start a session print(f'starting session to {instanceid}') response = ssm.send_command(instanceids = [instanceid], documentname="aws-runshellscript", parameters={'commands': [command]}) command_id = response['command']['commandid'] print(f'command id: {command_id}') # retrieve the command output time.sleep(4) output = ssm.get_command_invocation(commandid=command_id, instanceid=instanceid) print('please find below output -\n', output['standardoutputcontent']) mem_process_op = output['standardoutputcontent'] else: print('none')
登錄后復制
第二組(創建 jira 問題):
################## create jira issue ################ ##################################################### def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val): ## create issue ## url ='https://<your-user-name>.atlassian.net//rest/api/2/issue' username = os.environ['username'] api_token = os.environ['token'] project = 'anirbanspace' issue_type = 'incident' assignee = os.environ['username'] summ_metric = '%cpu utilization' if 'cpu' in metric_name else '%memory utilization' if 'mem' in metric_name else '%filesystem utilization' if metric_name == 'disk_used_percent' else none metric_val = metric_val summary = f'client | {account} | {instanceid} | {summ_metric} | metric value: {metric_val}' description = f'client: company\naccount: {account}\nregion: {region}\ninstanceid = {instanceid}\ntimestamp = {timestamp}\ncurrent state: {current_state}\nprevious state = {previous_state}\nmetric value = {metric_val}' issue_data = { "fields": { "project": { "key": "scrum" }, "summary": summary, "description": description, "issuetype": { "name": issue_type }, "assignee": { "name": assignee } } } data = json.dumps(issue_data) headers = { "accept": "application/json", "content-type": "application/json" } auth = httpbasicauth(username, api_token) response = requests.post(url, headers=headers, auth=auth, data=data) global issueid global issuekey global issuelink issueid = response.json().get('id') issuekey = response.json().get('key') issuelink = response.json().get('self') ################ add comment to above created jira issue ################### output = cpu_process_op if metric_name == 'cpuutilization' else mem_process_op if metric_name == 'mem_used_percent' else none comment_api_url = f"{url}/{issuekey}/comment" add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output})) ## check the response if response.status_code == 201: print("issue created successfully. issue key:", response.json().get('key')) else: print(f"failed to create issue. status code: {response.status_code}, response: {response.text}") </your-user-name>
登錄后復制
第三組(發送電子郵件):
################## send an email ################ ################################################# def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink): ### define a dictionary of custom input ### metric_list = {'mem_used_percent': 'memory', 'disk_used_percent': 'disk', 'cpuutilization': 'cpu'} ### conditions ### if previous_state == 'ok' and current_state == 'alarm' and metric_name in list(metric_list.keys()): metric_msg = metric_list[metric_name] output = cpu_process_op if metric_name == 'cpuutilization' else mem_process_op if metric_name == 'mem_used_percent' else none print('this is output', output) email_body = f"hi team, \n\nplease be informed that {metric_msg} utilization is high for the instanceid {instanceid}. please find below more information \n\nalarm details:\nmetricname = {metric_name}, \naccount = {account}, \ntimestamp = {timestamp}, \nregion = {region}, \ninstanceid = {instanceid}, \ncurrentstate = {current_state}, \nreason = {current_reason}, \nmetricvalue = {metric_val}, \nthreshold = 80.00 \n\nprocessoutput: \n{output}\nincident deatils:\nissueid = {issueid}, \nissuekey = {issuekey}, \nlink = {issuelink}\n\nregards,\nanirban das,\nglobal cloud operations team" res = sns_client.publish( topicarn = os.environ['snsarn'], subject = f'high {metric_msg} utilization alert : {instanceid}', message = str(email_body) ) print('mail has been sent') if res else print('email not sent') else: email_body = str(0)
登錄后復制
第四組(調用 lambda 處理函數):
################## lambda handler function ################ ########################################################### def lambda_handler(event, context): instanceid = event['detail']['configuration']['metrics'][0]['metricstat']['metric']['dimensions']['instanceid'] metric_name = event['detail']['configuration']['metrics'][0]['metricstat']['metric']['name'] account = event['account'] timestamp = event['time'] region = event['region'] current_state = event['detail']['state']['value'] current_reason = event['detail']['state']['reason'] previous_state = event['detail']['previousstate']['value'] previous_reason = event['detail']['previousstate']['reason'] metric_val = json.loads(event['detail']['state']['reasondata'])['evaluateddatapoints'][0]['value'] ##### function calling ##### if metric_name == 'cpuutilization': cpu_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) elif metric_name == 'mem_used_percent': mem_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) else: none
登錄后復制
報警郵件截圖:
注意:在理想情況下,閾值是 80%,但為了測試我將其更改為 10%。請看原因。
警報 jira 問題:
場景 2:當警報狀態從“正常”更改為“數據不足”時
在這種情況下,如果未捕獲任何服務器 cpu 或內存利用率指標數據,則警報狀態將從 ok 更改為 insufficient_data。可以通過兩種方式實現此狀態 – a.) 如果服務器處于停止狀態 b.) 如果 cloudwatch 代理未運行或進入死亡狀態。
因此,根據下面的腳本,您將能夠看到,當 cpu 或內存利用率警報狀態獲取的數據不足時,lambda 將首先檢查實例是否處于運行狀態。如果實例處于運行狀態,那么它將登錄并檢查 cloudwatch 代理狀態。發布后,它將創建一個 jira 問題并在 jira 問題的評論部分發布代理狀態。之后,它將發送一封包含警報詳細信息和代理狀態的電子郵件。
完整代碼:
################# Importing Required Modules ################ ############################################################ import json import boto3 import time import os import sys sys.path.append('./python') ## This will add requests module along with all dependencies into this script import requests from requests.auth import HTTPBasicAuth ################## Calling AWS Services ################### ########################################################### ssm = boto3.client('ssm') sns_client = boto3.client('sns') ec2 = boto3.client('ec2') ################## Defining Blank Variable ################ ########################################################### cpu_process_op = '' mem_process_op = '' issueid = '' issuekey = '' issuelink = '' ################# Function for CPU Utilization ################ ############################################################### def cpu_utilization(instanceid, metric_name, previous_state, current_state): global cpu_process_op if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA': ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name'] if ec2_status == 'running': command = 'systemctl status amazon-cloudwatch-agent;sleep 3;systemctl restart amazon-cloudwatch-agent' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) cpu_process_op = output['StandardOutputContent'] else: cpu_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!' print(f'Instance current status is {ec2_status}. Not able to reach out!!') else: print('None') ################# Function for Memory Utilization ################ ############################################################### def mem_utilization(instanceid, metric_name, previous_state, current_state): global mem_process_op if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA': ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name'] if ec2_status == 'running': command = 'systemctl status amazon-cloudwatch-agent' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) mem_process_op = output['StandardOutputContent'] print(mem_process_op) else: mem_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!' print(f'Instance current status is {ec2_status}. Not able to reach out!!') else: print('None') ################## Create JIRA Issue ################ ##################################################### def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val): ## Create Issue ## url ='https://<your-user-name>.atlassian.net//rest/api/2/issue' username = os.environ['username'] api_token = os.environ['token'] project = 'AnirbanSpace' issue_type = 'Incident' assignee = os.environ['username'] summ_metric = '%CPU Utilization' if 'CPU' in metric_name else '%Memory Utilization' if 'mem' in metric_name else '%Filesystem Utilization' if metric_name == 'disk_used_percent' else None metric_val = metric_val summary = f'Client | {account} | {instanceid} | {summ_metric} | Metric Value: {metric_val}' description = f'Client: Company\nAccount: {account}\nRegion: {region}\nInstanceID = {instanceid}\nTimestamp = {timestamp}\nCurrent State: {current_state}\nPrevious State = {previous_state}\nMetric Value = {metric_val}' issue_data = { "fields": { "project": { "key": "SCRUM" }, "summary": summary, "description": description, "issuetype": { "name": issue_type }, "assignee": { "name": assignee } } } data = json.dumps(issue_data) headers = { "Accept": "application/json", "Content-Type": "application/json" } auth = HTTPBasicAuth(username, api_token) response = requests.post(url, headers=headers, auth=auth, data=data) global issueid global issuekey global issuelink issueid = response.json().get('id') issuekey = response.json().get('key') issuelink = response.json().get('self') ################ Add Comment To Above Created JIRA Issue ################### output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None comment_api_url = f"{url}/{issuekey}/comment" add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output})) ## Check the response if response.status_code == 201: print("Issue created successfully. Issue key:", response.json().get('key')) else: print(f"Failed to create issue. Status code: {response.status_code}, Response: {response.text}") ################## Send An Email ################ ################################################# def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink): ### Define a dictionary of custom input ### metric_list = {'mem_used_percent': 'Memory', 'disk_used_percent': 'Disk', 'CPUUtilization': 'CPU'} ### Conditions ### if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA' and metric_name in list(metric_list.keys()): metric_msg = metric_list[metric_name] output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None email_body = f"Hi Team, \n\nPlease be informed that {metric_msg} utilization alarm state has been changed to {current_state} for the instanceid {instanceid}. Please find below more information \n\nAlarm Details:\nMetricName = {metric_name}, \n Account = {account}, \nTimestamp = {timestamp}, \nRegion = {region}, \nInstanceID = {instanceid}, \nCurrentState = {current_state}, \nReason = {current_reason}, \nMetricValue = {metric_val}, \nThreshold = 80.00 \n\nProcessOutput = \n{output}\nIncident Deatils:\nIssueID = {issueid}, \nIssueKey = {issuekey}, \nLink = {issuelink}\n\nRegards,\nAnirban Das,\nGlobal Cloud Operations Team" res = sns_client.publish( TopicArn = os.environ['snsarn'], Subject = f'Insufficient {metric_msg} Utilization Alarm : {instanceid}', Message = str(email_body) ) print('Mail has been sent') if res else print('Email not sent') else: email_body = str(0) ################## Lambda Handler Function ################ ########################################################### def lambda_handler(event, context): instanceid = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId'] metric_name = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['name'] account = event['account'] timestamp = event['time'] region = event['region'] current_state = event['detail']['state']['value'] current_reason = event['detail']['state']['reason'] previous_state = event['detail']['previousState']['value'] previous_reason = event['detail']['previousState']['reason'] metric_val = 'NA' ##### function calling ##### if metric_name == 'CPUUtilization': cpu_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) elif metric_name == 'mem_used_percent': mem_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) else: None </your-user-name>
登錄后復制
數據不足郵件截圖:
數據不足jira問題:
結論 :
在本文中,我們測試了有關 cpu 和內存利用率的場景,但是我們可以在很多指標上配置自動事件和自動電子郵件功能,這將減少監控和創建事件等方面的大量工作。 。該解決方案為我們提供了進一步推進的初步方法,但可以肯定的是,還可以有其他可能性來實現這一目標。我相信你們都會理解我們如何努力讓這一切產生關聯。如果您喜歡這篇文章或有任何其他建議,請點贊和評論,以便我們可以在接下來的文章中補充。 ??
謝謝!!
阿尼班·達斯