Zuora, powering the Subscription Economy®, provides the only SaaS platform that automates all subscription order-to-cash operations in real-time for any business. Companies in any industry can launch new businesses, shift products to subscription, implement new pay-as-you-go pricing and packaging models, gain new insights into subscriber behavior, and disrupt market segments to gain competitive advantage. Zuora serves more than 900 companies around the world in a wide range of industries, including Box, Komatsu, Rogers, Schneider Electric, Toshiba, Xplornet and Zendesk. Headquartered in Silicon Valley, Zuora also operates offices in Atlanta, Boston, Denver, San Francisco, London, Paris, Beijing, Sydney, Chennai and Tokyo. To learn more about the Zuora platform, please visit zuora.com.
We are Looking for Site Reliability Engineer for our Chennai Office.
- Part of a Global SRE team, based in Chennai, India & San Jose, US.
- Improve and build upon our automation tools for systems provisioning, monitoring, trending, and management.
- Communicate effectively with fellow SREs and other engineering teams, and describe problems succinctly with sufficient detail that you can hand-off an ongoing problem to another team or a peer for completion.
- During a crisis, lead the effort to triage and mitigate
- Manage real-time communications during outages with both technical and non- technical audiences
- Perform periodic on-call duty as part of a global team maintaining the availability and performance of RevPro SaaS.
- Strategize with fellow SREs and other engineering teams on complex problems, and make decisions and recommendations about systems improvements after analyzing possible courses of conduct.
- Perform performance analysis, proactive troubleshooting, continual improvement and capacity planning for production, virtualized environment
- Administrating Web Servers, Application Servers and Databases running applications.
- Develop policies and procedures that improve overall platform stability.
- Participate in reviews of outages in order to improve overall product stability.
- Build relationships with development teams and technology leaders across the company
- Over 3-5 years of experience operating and scaling services in a distributed, internet-scale environment
- Strong experience with Oracle database and hand-on experience with Postgres, MySQL is a plus.
- Strong knowledge of Linux operating systems and environment.
- Experience with monitoring, trending, and logging tools such as Logstash/ElasticSearch/Kibana, Cloudwatch. Splunk.
- Experience with Virtualization/Amazon AWS.
- Solid scripting skills; Experience with Shell, Python, Pl/SQL etc.
- Experience with setup, configuration like Ansible, Terraform, Postfix, Central Logging (syslog-Ng), SNMP and Monitoring systems (e.g. Nagios, Ganglia, Cacti) and other reporting tools.
- Experience in handling production outages and root cause analysis
- Strong crisis management leadership ability; Experience with Incident management.
- Hands on operational experience in a high-volume or critical production service environment
- Effective communication skills, whether talking to individual contributors or to executive management
- Ultimate self-starter
- Strong troubleshooting and problem resolution skills
- Experience creating tools for infrastructure (IaaS and PaaS) management and automation a plus
- Experience with complex SaaS or Production, revenue critical web services environments is a strong plus.
- Experience with Unix/Linux system administration especially in RedHat Linux (CentOS), Ubuntu environment
- Experience with environment configurations at network, OS and application levels
- Experience with environment monitoring in a 24/7 web application and ecommerce environments
- Ability to use scripting languages to automate tasks and gather data
- Demonstrate ability to use problem solving techniques such as root cause analysis to resolve issues
- Demonstrate ability to write and present effective materials, including presentations, status reporting, technical diagrams and flowcharts
- Ability to follow and adhere to policies, procedures and standards relating to Systems management. May recommend process improvements.
- B.S. degree (required); M.S. degree or equivalent technical training
- Ability to handle periodic on-call duty