Apply now »
5 May 2025

Senior Officer, Site Reliability Engineering (40001670)

Category:  Technology Division
Job Type: 
Facility:  Technology

Job Purpose

'Responsible for daily monitoring of IT infrastructure/applications/services for critical services (T24, ROC, COC, CARD, etc.), ensuring these critical services meet the committed SLAs with the business. Additionally, participate in handling alerts and incidents to restore services as quickly as possible and address any outstanding issues to ensure the best service delivery for customers.

Key Accountabilities (1)

'Participate in monitoring and handling system alerts/incidents/problems:
- Perform 24/7 monitoring and handle alerts of services of the entire IT infrastructure/application/services. In case encounter difficulties, escalate to L3 for coordinated processing.
- Ensure projects/specialized operations departments provide adequate warning/incident handling instructions for new services before golive and periodically review and update existing alert/incident handling instructions.
- Perform periodic reviews of issues/vulnerabilities in IT infrastructure/applications/services within the scope of responsibility
- Participate in standardizing and developing relevant processes and regulations to ensure effective monitoring and handling of alerts/incidents.
- Coordinate with relevant units to promptly restore services/systems, investigate root causes, propose solutions and implement solutions.
- Participate in implementing changes across the Production environment, including on Prem and cloud.

Participate in building and optimizing centralized monitoring tools:
- Implement the development and promulgation of standards and operate centralized monitoring tools (Dynatrace, Grafana, Splunk...) 
- Implement monitoring tool integration and support building monitoring charts for new IT infrastructure/applications/services
- Ensure projects/specialized operations departments provide adequate monitoring indicators/monitoring thresholds for new services before golive.

Key Accountabilities (2)

'System problem and incident management:
- Manage the lifecycle of IT incidents, including identifying, triaging, coordinating and resolving incidents according to SLAs
- Point of contact during troubleshooting, ensuring effective communication between technical, operations and sales departments
- Root cause analysis (RCA) after each incident, recommending preventive measures and process improvements. Coordinate with relevant teams to minimize downtime and improve system availability.
- Participate in developing and maintaining incident management processes according to standards and best practices

Key Accountabilities (3)

'Responsibilities in Risk Management and Compliance: 
- Support control and ensure the unit's activities comply with issued policies, regulations, procedures and instructions.
- Identify the unit's risks during operations, coordinate with relevant units to develop methods to measure, evaluate and minimize risks.

Report periodically to management levels and perform other tasks as directed by management

Key Relationships - Direct Manager

'Director / Senior Manager / Manager of ITSE

Key Relationships - Direct Reports

NA

Key Relationships - Internal Stakeholders

'Departments in IT and business

Key Relationships - External Stakeholders

'Partners providing professional services

Success Profile - Qualification and Experiences

'Qualifications
- Bachelor's degree or higher in Finance, Economics, Banking, Business Administration, or Computer Science.

Experience
- At least 5 years in IT development and operations at a large enterprise, especially in banking.

Language Proficiency
- English as per TCB regulations from time to time.

Other Requirements
- International certification in Systems.

Apply now »