2022 Sessions

Keynote (SRE-167): IBM Client Panel

Description: Sunil hosts an IBM client panel, discussing the topic of ‘Adopting SRE & impact to the organisation’. The panel will discuss the business side of the transformation, as well as culture and people aspects for a successful adoption of SRE principles.

Keywords: Adoption, Enterprise

Speaker Bios:

  • Narayanan (KK) Krishnakumar, VP and CTO at Delta Air Lines.  KK is an experienced technology leader and change agent. He cmbines extensive business and hands-on technical leadership experience over 20+ years delivering innovative business solutions. Executive management experience as CTO/Chief Architect for 10+ years.  Specializes in building and leading business application programs, strategy and enterprise architecture programs, technology and infrastructure groups. Intense, energetic and collaborative.
  • Kevin Yaniga – SVP Truist Bank, Data and Infrastructure Operations.  Over the last 3 year as Head of Enterprise Data Infrastructure and Operation at SunTrust now Truist, Kevin has lead a team merging two large banking IT operation teams in two datacenters into one high performing EDO Operations Squad. Kevin has maintained stellar Platform Availability of 99.99+ through consistent intentional proactive improvements, driving speeding of RCA, excellent merger readiness, and YoY improvements via SRE transformation. Kevin’s current goals in 2022-2023 for Truist EDO include building an Enterprise Data Guild model and building a state of the art DR datacenter in Texas. Kevin has maintained a 90%+ teammate engagement score while building and driving a strong, accomplished team with high goals.
  • Sunil Joshi, Distinguished Engineer, IBM.  As CTO for North America Cloud Application Development at IBM, Sunil currently drives the entire lifecycle of sales, IT strategy, and solution implementation on Cloud-Native Architecture, DevOps Engineering, and Platform Engineering Services for Fortune 500 clients.

 


Keynote (SRE-210): Jason McGee, General Manager IBM Cloud Platform and Ralph Bateman, Distinguished Engineer for IBM Cloud SRE

Speaker Bios: 

  • Jason is an IBM Fellow and Cloud CTO focused on building strong and enduring technology that benefits clients, partners and IBM.  As General Manager of IBM’s Public Cloud Platform, Jason is responsible for the technology development, business strategy, architecture and successful delivery with Clients of IBM’s multi-billion-dollar Cloud franchise.
  • Ralph is the Distinguished Engineer for IBM Cloud, SRE.  He pioneered SRE for IBM Cloud and have established many SRE teams across IBM.  Under his technical leadership the SRE team dramatically improved the availability and security of IBM’s public cloud services.  He is a transformational technologist who has run and supported large production environments throughout his entire career, both inside and outside of IBM.

 


Keynote (SRE-220): Rishi Vaish, CTO and VP, IBM Sustainability Software

Speaker Bio: Rishi Vaish, CTO and VP, IBM Sustainability Software
Rishi Vaish is a transformational product and engineering executive with a proven record of leading technology, product strategy, product development, product management, design and cloud operations for enterprise products to meet aggressive timeline, revenue and market-share goals. Breadth of experience spanning Artificial Intelligence (AI), Machine Learning (ML), Data Science, Hybrid Cloud technologies, Cloud Computing, Infrastructure-as-a-Service (IaaS), Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), Application Integration and Middleware industries and technologies. He operated successfully at a variety of scales from startups to multi-billion dollar product portfolios with globally distributed teams.

 


Keynote (SRE-240): Ingo Averdunk, Distinguished Engineer, Service Management and SRE, WW SRE Profession Co-Leader

Description: Ingo will reflect on the evolution on recent trends in the area of reliability. Starting from traditional Service Management approaches, the industry has moved towards SRE as an emerging practice to meet the increasing remands of reliable, scalable services. His session will highlight the key differences, and predicts how SRE may evolve even further.

Keywords: SRE, Evolution, Trends

Speaker Bio: Ingo Averdunk, Distinguished Engineer, IBM
Ingo Averdunk is a Distinguished Engineer in the IBM Cloud Expert Labs. As part of the worldwide team, he is responsible for Architecture and Solutions for Cloud Service Management and Site Reliability Engineering (SRE). Mr. Averdunk develops architectures and consults with IBM’s strategic customers, leads Cloud Adoption and Transformation initiatives, and performs RedTeam reviews globally. Ingo Averdunk is a member of the IBM Academy of Technology (AoT) and the global profession co-lead for SRE in IBM. He co-authored “The Cloud Adoption Playbook” (Wiley, 2018), documenting proven strategies for transforming an organization with the Cloud. Ingo is the Meetup Organizer for the SRE Meetup Munich.

 


Keynote (SRE-250): John Allspaw, Adaptive Capacity Lab and David Leigh, IBM

Speaker Bios:

  • John Allspaw Adaptive Capacity Lab.  John is an engineering leader and researcher with over 20 years of experience in building and leading teams engaged in software and systems engineering. He has spent the last decade bridging insights from Human Factors, Cognitive Systems Engineering, and Resilience Engineering to the domain of software engineering and operations.  Publications include the books “The Art of Capacity Planning” (2009) and “Web Operations” (2010) as well as the forward to “The DevOps Handbook.” Apparently, his 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.
  • David Leigh, IBM.  David is a Distinguished Engineer in IBM’s CIO organization where he focuses on Resilience Engineering and helping teams to improve their ability to cope with operational surprises in the complex, mission-critical systems for which they are responsible.

 


Keynote (SRE-260): Varun Bijlani, Global Managing Partner -Hybrid Cloud Transformation Services, IBM Consulting

Speaker Bio: Varun Bijlani is Global Managing Partner – Hybrid Cloud Transformation Services – IBM Consulting.  He is a business and transformation program leader with strong team leadership experience, business development, practice management, and programme management skills. Experienced in international and emerging-market business environments and have lived and worked in the UK, India, Western and Central Europe, the US, UAE and North Africa.

 


Keynote (SRE-270): Jerry Cuomo, IBM Fellow, VP and CTO of Technology & Consulting

Speaker Bio: Jerry is an IBM Fellow, VP and CTO of Technology & Consulting. He is recognized as a prolific contributor to IBM’s software business, producing products and technologies that have profoundly impacted how the industry conducts commerce over the world-wide-web while dramatically improving the consumer experience.  Jerry has exhibited a repeating pattern of breakthrough innovations in software design, engineering, and business strategy, across IBM’s most financially successful and industry-recognizable software product offerings. He has pioneered emerging technology projects in the areas of Hybrid Cloud, Blockchain, API Economy, Mobile computing, Web Application Servers, Integration Software, Java, and Instant Messaging Software, filling over 100 US patents across these areas.   Jerry is the CTO of Applied Hybrid Cloud & AI, which is part of IBM’s multi-billion dollar Services business. His mission is to drive IBM Technology deployments in Services engagements by advocating how and where IBM Technology best solves relevant customer problems with provable economics (time, cost, value). As CTO and IBM Fellow, Jerry is committed to growing the technical community and profession at IBM, including focusing on building a strong and diverse technical team with world-leading skills and ethics in AI and Hybrid Cloud.

 


SRE-001: Boundary security at scale

Presenter: Alexander Mckenzie-Kelly, Daniel Marshall

Description: The Network Intelligence (Netint) squad manages 268 network gateway appliances all around the globe with firewall configuration deploying up to 50 times per day. Our firewalls protect the IKS, ROKS, and Satellite public cloud offerings. With as few as 3 people in the squad we had no choice but to automate the generation and propagation of this config. We will discuss the evolution of our tooling and share our learnings.

Keywords: Automation, Firewall, Security

Speaker Bio:

  • Alex Mckenzie-Kelly is the Netint squad lead. He has worked at IBM since joining as a graduate in 2013 across a variety of roles including performance test, development and SRE.
  • Daniel Marshall is a senior member of the Netint squad. He has worked at IBM since joining as a graduate in 2019. He has never known the pain of a role outside Netint.

 


SRE-004: Learning from Customer Impacting Events

Presenter: Justin-Kristopher Laws

Description: This talk will centre around creating a positive review of customer impacting events. As SRE leaders it’s our role to ensure SLA’s are met with the service we provide. When service disruptions arise, the SRE team primary role is to ensure emergency response processes are followed to the letter to quickly restore the service. The value for any SRE team is not how quickly you respond and restore service, however the postmortem review session when done properly can yield maximum value to the team and to the service. I’d like to share some lessons learned in hopes that someone in the audience would find ISV’s outage experiences helpful.

Keywords: Learning from Incidents


SRE-007: Monitoring Amazon EKS with Instana

Presenter: Andrew Zhang

Description: Instana is a fully automated APM solution for cloud-native applications. With a single agent, we auto-discover application building blocks, trace every request, and create a dynamic graph of all dependencies. With a tight integration with EKS, Fargate and Lambda, Instana saves DevOps teams headaches and accelerates software development.

Keywords: DevOps, Cloud, Monitoring


SRE-009: 4000 years of Observability – The Universe is the Largest Production Environment in Existence

Presenter: Robert Barron

Description: While “Observability” is considered one of the newest and shiniest items in the SRE toolbox, humans have been observing the skies for thousands of years and, ever since the development of astronomical observability tools such as telescopes and spectroscopes, have been adopting the concept of “understanding the universe by observing the stars” or, as an SRE would say “the practice of assessing a system’s internal state by observing its external outputs.”
While standing far away from the target of their observations, astronomers can understand their inner workings and predict precisely how they will change in the near and far future. This is a level of observability SREs dream of having!

Of course, there are differences between astronomers and SREs – for example, SREs initiative and plan Chaos experiments to test their environments while astronomers merely aim their telescope in a new direction and see an experiment in action – but in the end, they are both using scientific and engineering tools to understand the innermost workings of the environments they are responsible for.

Astronomers have simply been doing it for orders of magnitude longer than SREs!

Keywords: Observability, Learning from Incidents


SRE-011: How can the IBM Cloud Pak for Watson AIOps help my IT operations team?

Presenter: Ricardo Olivieri

Description: In this session, we will provide a quick introduction to AIOps and to the IBM Cloud Pak for Watson AIOps solution. We will explain that the purpose of IBM Cloud Pak for Watson AIOps is not to replace existing monitoring tools in your environment, but instead to consume data from those tools, and by consolidating and tracking multiple sets of data from different resources, to empower your teams to react faster with more knowledge and deeper understanding. We will also do a quick demo to showcase some of the Cloud Pak for Watson AIOps capabilities such as log anomaly detection, grouping of events, and topology, among others.

Keywords: AIOPs, IBM Cloud Pak for Watson AIOps, IT, IT operations, SRE

Speaker Bio:

  • Ricardo Olivieri is a Senior Technical Staff Member and IBM Cloud Pak for Watson AIOps Solutions Architect with the AIOps Elite Team. His areas of expertise include gathering and analyzing business requirements, architecting, designing, and developing software applications. Ricardo has extensive experience in the complete software development cycle and related processes, especially in Agile methodologies. Ricardo is now mainly focused on helping customers adopt the IBM Cloud Pak for Watson AIOps so that they can prevent critical issues before they occur and resolve them quickly when they do occur. His background in application development and DevOps technologies makes him perfectly suited to assist clients in their AIOps journeys.

SRE-013: SRE Manager Panel

Presenter: Kaniz Sivjee

Description: Join our host Kaniz and a panel of IBM and Red Hat SRE managers answer questions on becoming an SRE manager, what it entails, the qualities you need and the challenges you will be up against.

Speaker bios:

Kaniz Sivjee (panel host)
Location: IBM Canada, Markham, Ontario, Canada
Division: IBM Data & AI
Current role: Planning Analytics – Site Reliability Engineering Sr. Manager
Prior experiences: Quality Assurance Manager, DevOps Manager, Product Manager

Thomas Fiege:
Location: IBM San Jose, CA
Division: IBM Systems
Current role: Hyper Protect Services – Site Reliability Engineering Manager
Prior experience: High End Storage Systems Development; Team Lead; Release Lead; Management

Candace Sheremeta:
Location: Cary, NC
Division: OpenShift SRE @ Red Hat
Current role: Associate Manager, Site Reliability Engineering
Prior experience: In a previous life, I was a high school English teacher. After going back for a CS degree and landing at Red Hat, I worked on both Satellite and OpenShift before moving into SRE management.
Jayshree M:
Location: Bangalore, Ind
Division: Hyper Protect Services
Current role: Manager, Site Reliability Engineering
Prior experience: Java & Web developer, DevOps Engineer, Design Thinking evangelist, Security & Compliance Manager.

Keywords: SRE Management, Discussion


SRE-015: The IBM CIO Learning from Incidents Program: Panel Discussion

Presenters: David Leigh, Sachin Avasthi, Jenna Lavigne, Suj Perepa, Josh Whetton

Description: The CIO Learning from Incidents (LFI) Monthly Meeting brings together CIO Executives, STSMs, and others to discuss a small number of incidents in depth. This monthly meeting is an important aspect of the larger program to scale our ability to learn from incidents. An explicit goal of the CIO Learning from Incidents meeting is to inspire team-based incident learning practices outside of this monthly event. So what have those who have participated in our new Learning from Incidents practice learned? How have they been inspired? This 30-minute panel discussion will cover a wide range of topics, including what it was like to have one of their surprises be the topic of large group discussion, what participants have learned about their systems through their participation, and what changes they have observed in how teams now approach incidents as a result of this experience.

Keywords: Automation, Firewall, Security

Speaker Bio:

  • David Leigh is a Distinguished Engineer in IBM’s CIO organization where he focuses on Resilience Engineering and helping teams to improve their ability to cope with operational surprises in the complex, mission-critical systems for which they are responsible. During his career, David has been a software developer, architect, SRE, DevOps evangelist, and an engineering manager. It was during his days as a DevOps evangelist nearly 10 years ago that he was first exposed to the ideas behind the modern approach to learning from incidents. Since 2015, David has represented IBM in the SNAFUcatchers consortium for resilient business-facing IT, participated in the workshops which produced the 2017 Stella Report on Coping with Compexity, and presented at the Resilience Engineering Association Symposium in 2017 and 2019.
  • Sachin Avasthi is a Senior Technical Staff Member in IBM’s CIO organization where he focuses on Application Modernization and helping teams assess, identify, and implement a plan to modernize their application. He also works with IBM Research to co-create tools that can help accelerate the modernization journey and scale the use of these tools. During his career, Sachin has been a software developer, a DevOps evangelist, and an architect. He has a passion for researching emerging technologies and implementing them in a real-world use case in an enterprise application landscape. It was part of these implementations that he recognized the need to utilize incidents as a way of learning rather than fault finding exercise. Outside of the day-to-day work, he spends a lot of his time mentoring developers and architects on career and professional development.
  • Jenna Lavigne leads the CIO Quote-to-Cash DSW Sub-Domain with ownership of Demand Management, Discovery, Agile Project Management, Testing including Automated Testing, Deployments, End User Support and Operations. She has over 20 years of experience with IBM, and a passion for client success and agile ways of working. She leads a highly effective, geographically dispersed team responsible for many aspects of IBM’s quote, ordering, sales, and fulfilment systems for Software, SaaS and Appliance lines of business.
  • Rosalind Radcliffe – IBM Fellow, CIO DevSecOps CTO, AoT DevOps and SRE Co-Team Lead
  • Josh Whetton is a Site Reliability Engineer in CIO’s Developer Experience tasked with delivering GitHub Enterprise at scale for IBM’s developers. Throughout his career, Josh transitioned from emergency medicine to computer engineering via military, DOD, Homeland Security, and finally IBM. It was when Josh joined Project Whitewater that he was introduced to new techniques for learning from incidents. This further led to working with professionals in the field and a desire to help others learn from their incidents.

 


SRE-016: How to get started as a post-incident review facilitator: Demystifying the practice for newbies by sharing personal experiences and practical how-tos

Presenter: Randall (Randy) Horwitz

Description: It’s becoming well known that “5-whys” is an ineffective approach to post-incident learning. There is never “one reason” a complex system fails.

Fortunately, there are new and better methods to learning from incidents. Enabling those who were part of an incident to tell their story and enabling your organization to learn is key.
There are many videos, papers, and books out there but it can be daunting to know where to start, and a new facilitator does not have weeks to read everything. So where do they start? How do they keep learning as they go?

Investigating a major incident takes time and care. Typically, a timeline has to be written, interviews may need to be conducted, a draft report needs to be written and shared with those who participated in the incident, and then a final learning session has to be held.

This 30-minute talk will share key resources and actions you can take to come up to speed as a new facilitator or share tips for those who are currently facilitating.

Join Randy Horwitz, post-incident review facilitator, IBM CIO, to learn:

– How to use the Howie: The Post-Incident Guide to be a framework for incident investigations.
– What videos and articles are key to getting started in days.
– Dos and don’ts on creating timelines including key pieces to look for
– Dos and don’ts on conducting interviews
– Dos and don’ts on facilitating learning sessions
– Suggested further reading items.

Keywords: #LFI, #LEARNINFROMINCIDENTS, #HOWTO, #GETTINGSTARTED

Presenter Bio: 

  • Randall (Randy) Horwitz currently works as an Incident facilitator on the CIO Resilience Engineering team led by David Leigh.Since 2016, when he worked as the support manager for the IBM Developer Experience, Randy has been passionate about development teams being able to respond to and learn from their incidents. For example, in 2017 he was instrumental in the Virtual Private Cloud UI team being the first in its organization to have a documented follow the sun incident response process. Recently, Randy learned of the CIO Learning from Incidents program and was so excited he decided to pivot his career into the space.He graduated with a Batchelor’s of Science in Computer Science from the Rochester Institute of Technology in 1999 and has been with IBM ever since. One of his proudest accomplishments remains being a totally blind UI developer on the WebSphere Admin Console team, where he drove line items to make it 100% accessible to those with disabilities.

SRE-019: What is the modern approach to learning from incidents and how is it different from Root Cause Analysis

Presenter Name: David Leigh

Theme: Learning from Incidents

Description: I predict that a system for which you are responsible is going to fail – because all our systems are complex and all complex systems fail.  And I predict that in order to prevent that failure from happening again, you’re going to introduce some new automation that everyone agrees will make your system better.  And then some time later – maybe soon or maybe quite a while later – that automation that you added will contribute to another failure that is every bit as bad – or worse – than the failure it was designed to prevent.

As engineers and responsible professionals, we are compelled to improve our systems in order to avoid problems that we have experienced in the past. However we often overlook how those improvements also increase the system’s complexity, making it harder for us to deal with future surprises.

One of the best ways to avoid this pitfall is to understand the “system” as a combination of the technical components and the humans who are keeping it running through activities that include observing, inferring, anticipating, planning, troubleshooting, diagnosing, modifying, and correcting. By looking more closely at the experience of practitioners during surprise events (aka “incidents”) we can begin to understand how those those activities are accomplished and how to make them easier.  And we can design system improvements in a way that does not make those activities harder.

Speaker Bio: David is a Distinguished Engineer in IBM’s CIO organization where he focuses on Resilience Engineering and helping teams to improve their ability to cope with operational surprises in the complex, mission-critical systems for which they are responsible. During his career, David has been a software developer, architect, SRE, DevOps evangelist, and an engineering manager. It was during his days as a DevOps evangelist nearly 10 years ago that he was first exposed to the ideas behind the modern approach to learning from incidents. Since 2015, David has represented IBM in the SNAFUcatchers consortium for resilient business-facing IT, participated in the workshops which produced the 2017 Stella Report on Coping with Compexity, and presented at the Resilience Engineering Association Symposium in 2017 and 2019.

 


SRE-020: SRE 101. 50 SRE concepts in 5 minutes

Presenter: Yuvini Velasquez

Description: Learn 50 SRE concepts with a quick breakdown of the most common jargon used by SREs to better understand this field.

Keywords: SRE Concepts


SRE-023: Using GreenOps for sustainable computing – and saving some money!

Presenter Name: Bas Pluim, Bart Schrooten, Katarzyna Soltyk

Theme: Responsible Computing and Sustainability

Description: If you’re using public cloud, you’re probably already practicing some sort of FinOps to help manage your costs. GreenOps can take this to the next level, augmenting your decisions on where to deploy your workload based on the carbon footprint. Whether you’re doing this to attract new customers, motivate your employees, or comply with government regulation, GreenOps is an essential part of any sustainability program. This session will provide you with an overview of GreenOps, dig deeper into how to minimize the environmental impact of running a workload in the cloud, and how the Nordcloud toolset can automate this work.

Keywords: GreenOps, FinOps, sustainability

Speaker Bio:

  • Bas is the DevSecOps and IT Automation Practice Lead for the Americas and is responsible for making sure IBM brings the smartest, best trained engineers to our clients to help them build and manage their cloud applications. He has over 20 years of experience delivering software, leading teams, and driving complex client transformations.
  • Bart is the go-to-market lead for IBM Multicloud software suite in Americas and Northern Europe. He has over 20 years experience in helping clients implement smart solutions for information and cloud management. IBM Multicloud is an innovative suite of products enabling FinOps and DevOps practices in simplifying the IBM client’s cloud journey, optimizing cloud costs and supporting sustainability.
  • Katarzyna is a Nordcloud Product Manager in charge of the Klarity cloud management tools. She creates the product vision, strategy, roadmap and aligns stakeholders around it. She collaborates closely with engineering, sales, marketing, and support to ensure that business case and customer satisfaction objectives are met

 


SRE-026: Supporting the Post Mortem Process (Problem Mgmt) with Enhanced Monitoring

Presenter Name: O.J. Dua

Description: A mature Problem Management discipline seeks to identify the root cause of an outage to mitigate the risks of recurrence. In a blameless postmortem, key stakeholders into an incident review what happened, why it happened, and how to prevent it from happening again in a safe, judgement-free environment. Determination of what alerts triggered or did not trigger should be a key component of this workflow.

Keywords: #PostMortem #ProblemManagement #ITIL #Monitoring #Observability #IncidentManagement #Incidents #Logs #Instana #CP4WAIOPS #AnomalyDetection #Watson

Speaker Bio: O.J. Dua is Senior Technology Engineer in IBM Client Engineering. He has over 25 years of experience in IT Operations working for a large, US-based financial services company where he held roles in end-user desktop support to virtual server engineering to global incident management in the firm’s Network Operations Center (NOC) . His most recent role prior to joining IBM was Director for all monitoring, observability, and event management solutions. This role saw him participating in Root Cause investigations and post-mortem discussions.


SRE-033: Incident Remediation using Context-Driven Resolution Recommendation for Persistent Log Anomalies

Presenter Name: Rohan Arora, Harshit Kumar, Ruchi Mahindru, Prateeti Mahopatra, Seema Nagar

Theme: Learning from Incidents

Keywords:  #persistence, #domain-awareness, #golden-signal, #incident-prioritization, #resolution-recommendation, #out-of-the-box-models, #context-driven-incident-remediation #aiops

Description: While several advances have been made with the use of AI-based tools for IT Operations, yet, SREs are overwhelmed by the number of incident records and constituent alerts that are created. The objective of this presentation is two-fold 1) reduce noise and alert fatigue, thereby, identifying alerts that persist and affect end users, and 2) resolution recommendation, i.e., pre-populating the incident with the recommended resolution for faster and effective incident remediation.

Speaker Bio: 

  • Rohan Arora is a Senior Software Engineer at the IBM T.J Watson Research Center in Yorktown Heights, New York. He is currently focused on applying machine learning to the areas of IT operations and sustainable computing. In his past role at IBM, he has built and deployed cloud-native applications to facilitate augmented and mixed reality applications leveraging Django, Kubernetes, agile development principles, and CI/CD pipelines. He received a master’s degree in Electrical and Computer Engineering (ECE) from the University of Illinois at Urbana-Champaign in 2016.
  • Harshit Kumar is a Senior Technical Staff Member at the IBM India Research Labs. His primary research interests are in the area of machine learning, deep learning and its applications in conversation-AI and IT Operations. He has filed over 20 patents and published more than 25 research papers in data mining and AI conferences such as COLING, EMNLP, AAAI, ICSE, ESEC/FSE, ASE, IEEE CLOUD. Harshit has earned several IBM Research Accomplishment (RA) and Outstanding Technical Achievement Awards (OTAA) for his contributions to the Watson Assistant and Watson AIOps. As a technical lead for Watson AIOps, he is responsible for leading the team to execute the business impact in areas relevant to AIOps, APM, and ARM. His current research is focused on improving the Operations management in IT systems. He received a Ph.D. in Computer Science and Engineering from the Seoul National University in 2014.
  • Ruchi Mahindru is a Senior Technical Staff Member at IBM T. J. Watson Research. For over a decade, she has been involved in several key strategic missions critical for IBM’s business in the area of IBM Cloud Managed Services, IBM Consulting, and AI for IT Operations. She is skilled in building scalable and reusable distributed AI-driven solutions for effective problem remediation to improve provider productivity and customer experience. She is currently leading the global agenda for building Out-of-the-Box models and Precognition for AIOps toward predictive and proactive problem detection and remediation. Her research interests are in context-aware knowledge extraction, semantic data analytics, and inferencing. She received her M.S. degree in Computer Science from the City University of New York in 2005.
  • Prateeti Mohapatra is a Research Manager at IBM Research Lab India, working with a team of researchers in the areas of AI in Operations, Observability and Cognitive Support. Her current focus is on bringing AI solutions to IT operations thereby providing more insights with reactive, predictive, and proactive capabilities across the application life cycle. She has received numerous awards including Research division awards and Outstanding technical achievement awards in the areas of Cognitive Support and AIOps. Prior to that, she was employed with the industrial software systems group at ABB Corporate Research (Bangalore) and the Flash Center/Knowledge Lab group at Uchicago/Argonne National Lab (Chicago). Prateeti’s primary research interests are in the areas of AI, Natural language processing and Statistics. Prateeti is also an IEEE senior member and has published over 25 research papers at conferences such as COLING, AAAI, IEEE Cloud and Interspeech.
  • Seema Nagar is a staff research scientist at IBM Research India. She has been researching in the field of computer science for the last fifteen years. She has over 45 publications in eminent conferences and journals and more than 150 patents filed. She has been named a master inventor for two consecutive terms. She has also been part of the reviewer committee of many eminent conferences, such as IJCAI, NAACL, EMNLP, IEEE Cloud and ACL. She has earned several Research Awards in IBM Research India, including three Outstanding Technical Achievement Awards for her work on social network analysis and trustworthy AI. She actively mentors several researchers/students at the research lab. She obtained her B. Tech. and M. Tech. in Computer Science and Engineering from IIT, Guwahati in 2007, and IIT, Delhi in 2011 respectively. Currently, she is also pursuing a part-time PhD from IIIT, Guwahati.

 


SRE-038: We can build self-driving cars, but what about self-driving operations?

Presenter Name:  Isabell Sippli, Kristian Stewart

Theme: Reliability in Hybrid Cloud and AI

Description: There seems to be a lot attention on autonomous systems these days, and an extra focus on autonomous driving.
Just recently, a major automotive manufacturer announced a Level-3 approval for their automated driving.

So this is self driving cars, right?
And as you are related to operations, did you ever wonder why can we not build self-driving operations?
In this talk, I’ll explore what it actually means to have a self-driving car, and if there are any comparable principles to operations. We’ll also discuss if and when we can reach the same level of autonomy for operations. We’ll close with touching upon on recent trends in AIOps.

Keywords: autonomous, aiops, operations

Speaker Bio:

  • Isabell Sippli is a Senior Technical Staff Member for AIOps at IBM. She leads an international development team, and drives innovations in the field of AI for operations management. Isabell works closely with clients and partners, and is a sought-after speaker on AIOps and operational challenges.
  • Kristian Stewart is an IBM Distinguished Engineer currently working on solutions for the management of virtual network assets for Telecommunications and Enterprise. He has more than 20 years of experience in IT and Network Operations and has pioneered the development and productization of IBM AIOps capabilities to meet new challenges in these areas.

SRE-042: The Enterprise SRE mindset needs SLOs and OKRs

Presenter Name: Ingo Averdunk

Theme: Reliability in Hybrid Cloud and AI

Description: Every SRE knows about the importance of SLOs and SLIs. But what about OKRs ? OKRs – Objective and Key Results – are an important function in helping clients to transform operations and adopt SRE principles. Objectives and key results (OKR) is a goal-setting framework for defining and tracking objectives and their outcomes. OKRs comprise an objective – a clearly defined goal  – and 3–5 key results — specific measures used to track the achievement of that goal. The goal of OKR is to define how to achieve objectives through concrete, specific and measurable actions. Let’s find out how OKR can help you define your SRE adoption plan , and if you are moving the needle in the right direction.


SRE-046: FMEA as a tool to improve MTTD and MTTR KPI’s

Presenter Name: Aparna Srinivasan,  Manjunath Kallannavar

Theme: Learning from Incidents

Description: With the rapid improvement and increasing demand to have a smaller time to recover a system, especially cloud storage service which forms the backbone of all the other services/components involved, it is needed to have proper and effective analysis to be done to understand the ‘failures points’ and what are the recovery actions needed when a particular failure happens. As a background process it is need of the hour to have “Failure Mode and Effects Analysis” to be done on each and every component/service involved in the IBM Cloud Storage Platform. With this analysis – “Failure Mode and Effective Analysis” it is easier to brainstorm potential failure modes, their effective causes, effects and rank the occurrence, severity, detection capability built to identify the failure as soon as possible.

In this presentation, one can expect to understand the method to use to shorten the mean time to recover, repair covering few use-cases to elaborate on how effective it is to use this process with IBM Cloud Storage Platform to shrink the recovery time. Use case will briefly cover a workflow where the failure occurs, how to identify failure points in a workflow and use ‘Failure Mode and Effects Analysis’ for quickly recovering the system. This method will be helpful to meet the Service Level Agreements (SLA) provided to the customers achieving Service Level Objectives(SLO) defined keeping the Service Level Indicator to comply with the SLOs defined.

Keywords: #FMEA #bookofWork #error

Speaker Bios:

  • Aparna Srinivasan is working as a Service Reliability Engineer for IBM India Software Labs. She has over 10 years of experience. She is a Certified Scrum Master holding four patents. She has written many technical articles and blogs.
  • Manjunath Kallannavar has total 5+ years of experience in IT industry. Worked over 4+ years in Mobile domain and currently exploring SRE role since last one year. Currently working as SRE for software defined storage

 


SRE-048: The day of 10,000 alerts. A SRE’s worst nightmare.

Presenter: Tony Garcia

Description: This talk will discuss preparedness and the state of mind for IKS SREs. We have thousands of machines and tens of thousands of clusters across the world. How do we handle the stress? How do we handle a catastrophic situation?

Keywords: Alerting, Cloud

 


SRE-050: Learn More, Earn More: Certify to Validate Your SRE Skills

Presenter Name: Marissa Moore

Theme: Learning from Incidents

Description: This presentations will focus on ICCT and our robust suite of certifications, with a specific focus on the SRE Learning Paths – both Associate and Professional. It will also discuss the value of getting SRE certified and the organizational benefits of work with a fully certified team.

Keywords: #IBMcloud #certify #ibmcloudcertify #icct #cloudtraining #ibmcloud #cloud #SRE #sitereliabilityengineer

Speaker Bio: Marissa Moore is a curriculum manager with the IBM Center for Cloud Training and is responsible for managing the development of the SRE Curricula. She has over 12 years of experience in training development and delivery and change management, and is currently working towards her own professional certifications.


SRE-054: Modernizing Service Management

Presenter Name: Kevin S Green

Theme: Reliability in Hybrid Cloud and AI

Description: As the Cloud Service Management and Operations (CSMO) team within IBM Public Cloud eXpert Labs (CXL), we have been helping clients over several years to modernize their service management practices. We have honed several deliverables that have proven to be key tools in our  client engagements. This session will describe our approach to modernizing operations and showcase some prominent client engagements.
As the base for our engagements, we have adapted the IBM Garage Method by focusing on service management processes and SRE . In interactive Enterprise Design Thinking sessions we review current operations and identify areas for improvements. From there we create operational Minimum Viable Products (MVPs) to help initialize the clients service management transformation. Using a co-create model and short-term sprints, we provide our hands-on expertise to clients, enabling them for self-sufficiency.
Supporting the transformation are a variety of maturity assessments (skill, adoption), toolchains, as well as a Vision & OKR workshop.
We have aligned these assets also for IBM Public Cloud in order to help clients onboard not only their workload to IBM Public Cloud faster, but also to establish appropriate service management practices to ensure velocity and quality.

Keywords: SRE, Modern Service Management

Speaker Bio:

  • Shili Yang is a senior managing consultant on the IBM Cloud Expert Labs team. Shili has over 15 years experience helping enterprise clients design, build and operate mission critical IT solutions with IBM products and hybrid cloud platform. Most recently she’s focused on cloud Service Management and Site Reliability Engineering (SRE) projects and worked with clients to modernize IT operations as part of their digital transformation journey. Before joining the services team, Shili had been part of the software development teams responsible for the Db2 and WebSphere Business Integration products.
  • Kevin Green is an IBM Certified Thought Leader and part of the IBM Public Cloud Expert Labs team. IBM leadership in Sales, Product Development, Product Management, Product Support and Services. Leader in Cloud Service Management and Operations (CSMO) team and founding member. CSMO leads modernization of client service management activities including Site Reliability Engineering(SRE), Build to Manage and AIOps. Kevin has multiple clients in the establishing their SRE practice.

 


SRE-055: An iterative and  repeatable approach to get started with SLI /SLO

Presenter Name: Shili Yang, Kevin Green

Theme: Reliability in Hybrid Cloud and AI

Description: The use of SLI/SLO and SLA to ensure the reliability of IT services is one of the core SRE practices. This talk gives a quick review of an iterative and  repeatable approach we’ve developed to help  clients adopt the SRE approach and get started with SLI/SLO for the IT services within an organization.

Keywords: SLO, SLI, SLA

Speaker Bio: 

  • Shili Yang is a senior managing consultant on the IBM Cloud Expert Labs team. Shili has over 15 years’ experience helping enterprise clients design, build and operate mission critical IT solutions with IBM products and hybrid cloud platform. Most recently she’s focused on cloud Service Management and Site Reliability Engineering (SRE) projects and worked with clients to modernize IT operations as part of their digital transformation journey. Before joining the services team, Shili had been part of the software development teams responsible for the Db2 and WebSphere Business Integration products.
  • Kevin Green is an IBM Certified Thought Leader and Cloud Service Management and Operations(CSMO) Practice founding member . Part of the IBM Public Cloud Expert Labs team. IBM leadership in Sales, Product Development, Product Management, Product Support and Services. Leader in Cloud Service Management and Operations (CSMO) organization leading modernization of client service management activities including Site Reliability Engineering(SRE), Build to Manage and AIOps. Have led establishment of a SRE practice with multiple clients.

SRE-059: Etcd variable selection for cluster health monitoring

resenter Name: Sujith K Varghese

Theme: Responsible Computing and Sustainability

Description: The focus of this session is about Etcd variables which is used to monitor an IBM cloud cluster health. An IBM cluster health parameters can be stored and tracked on Etcd database using Etcd variables list which include aspects such as quorum limit, memory usage, GRPC requests, quota low space etc. These variables are stored using a prometheus server in the cluster. Sysdig helps to scale this and retrieve these data from the prometheus server. These data can either be displayed on a Grafana dashboard or using Sysdig’s monitoring tool.

 


SRE-064: Smart allocation strategies for IBM Cloud Transient Virtual Servers

Presenter Name: Satwik Bhimavarapu, Aditi Singh, Vinay Aggarwal

Theme: Responsible Computing and Sustainability

Description: The IBM Cloud Virtual Servers transient offering is a good option if one has flexible workloads and wants cost savings. They are provisioned when unused capacity available is available. Therefore, when data center resources are needed for full, on-demand accounts, one can also lose those resources. Transient instances are de-provisioned on a first-on, first-off basis when those resources need to be reclaimed. These instances offer the following flexibility, Global availability, and Cost Savings. They are ideal for non-production workloads such as stress testing of applications. They can be configured to receive a notification that it is terminated 2 minutes before the actual termination. To help the users, we propose a proactive and reactive approach.

The proactive approach: As the Cloud Provider, we have the history of all the transient servers, the time they ran, etc., for all those which have been provisioned. Combining this data with Machine Learning & Artificial Intelligence approaches, we can present a user with some confidence on how long an instance can be available before it is pre-empted.

The reactive approach: An “auto-balance recommendation” is a signal that notifies you when a Transient instance is at elevated risk of interruption. The signal can arrive sooner than the two-minute interruption notice, allowing one to manage the Instance proactively. The user can decide to rebalance the workload to new or existing instances that are not at an elevated risk of interruption. While it is not always possible for IBM Cloud to send the auto-balance recommendation signal before the two-minute interruption notice, in such cases, the signal can arrive along with the two-minute interruption notice. The auto-balance recommendation can be made available as a Monitoring event, and events can be emitted on a best effort basis.

Speaker Bio:

  • Satwik Bhimavarapu: I’m working as part of IKS Carrier & SRE team in IBM Cloud from the past 2 years. My key contributions have been towards the development of IKS Carrier, and maintenance of the IKS control plane by creating and contributing to various automations.
  • Aditi Singh: I’ve been working as a part of IKS SRE team in IBM cloud since last 2 years. I’m a team-player and passionate about my work. My key contribution has been towards creation of various automations and maintaining reliability of IKS master control plane.
  • Vinay Aggarwal: I’m working on IKS master control plane as SRE-dev role with expertise in Dev-ops, cloud, virtualization. I have hands on experience on multiple VMware , Cloud products (IKS ,AWS, OCI, VMware hybrid) and devops tools.

SRE-069: Bastion implementation for IBM Cloud Kubernetes Services

Presenter Name: Soumya Shankar Ghosh, Anees Patel

Theme: Responsible Computing and Sustainability

Description: Growing a PaaS business like IBM IKS means bringing on new verticals and selling more to customers. The bigger the customers, the higher the stakes. There is often no bigger customer than the federal government, so many PaaS organizations must demonstrate FedRAMP compliance in order to sell into this lucrative vertical. This paper walks you through the logical controls that you must implement in order to pass a FedRAMP audit and how IBM achieved this using the Bastion and Teleport Access Plane.

The term bastion comes from the fortifications that arose when cannons started dominating the
battlefield. Similar to Medieval structures, computer networks need layers of protection
against intruders. Bastion hosts, like their physical counterparts, are a part of this defensive
Perimeter,

Teleport has made obtaining a FedRAMP Moderate that much more achievable via their FIPS 140-2 endpoints, easy integration with our SSO and MFA, and the view into audit logs of remote connection sessions provides the appropriate insight for continuous monitoring.
Teleport’s design goal is to provide sensible choices by default. As a result, Teleport automatically enforces most of the best practices without additional configuration.We at IBM Cloud leveraged this built in functionalities of Teleport in order to get our cloud FedRAMP compliant.

This paper discusses the various FedRAMP requirements and how IKS SRE implemented a solution that enabled IBM Cloud IKS as FedRAMP compliant managed solution.


SRE-073: CSUTILs – A DevSecOps tool in IBM Cloud to keep the infrastructure compliant

Presenter: Milan Viradia

Description: In this session, we will see how we are leveraging DevSecOps practice in our Cloud Infrastructure to keep it compliant. We developed a tool which combines the various set of cloud-security and monitoring software and using it for log monitoring, health-checking, scanning, file-monitoring etc. This tool is being used by various IBM Service teams to keep their clusters compliant. Recently, we published this tool as an add-on so that Service teams can use it easily.

Keywords: Log Monitoring, Compliance, Automation

Speaker Bio:

  • Milan Viradia
    Joined IBM as IKS SRE, India in July 2021. He started his IT career journey with IBM. Before joining IBM, he just completed his M.Tech. in Computer Science. As part of IKS SRE, he started contributing to SRE Bots development, incident handling, automations etc. He majorly contributed to csutil add-on automation.

 


SRE-074: How we dealt with 1000+ alerts/day

Presenter Name: Jeeva Tharmakulasingam

Theme: Learning from Incidents

Description: In this talk, I highlight how our front line SREs were able to react to 1000+ alerts/day. I focus on the tooling that we use to help keep SREs sane. Being in front of a large amount alerts led to rapid development of tooling/automation. First, we have a unified operations dashboard that focuses on team communication. It’s used for adding comments to our alerts and ensuring they are preserved in tickets. Some are also persistent comments. Next, we have a notification system for getting in touch with customers, whether they are for outages or actions that we’d like customers to perform. We have inventory management to describe all customer instances and nodes. In addition, we have a centralized Change Management view for planned maintenance and customer actions such as deployments, scaling & restores. We created a metric to measure alerting. It’s the total number of instances/alerts over a 24hr period. We identified the heavy hitters and focused on bringing down the alert count down by addressing alert duplication and tuning. Next, we built automation tools to suppress the “noise” around planned maintenance and self-healing to address alerts before it reaches operations.

Keywords: SRE, Efficiency, Toil

Speaker Bio: Jeeva is the SRE Architect for Hybrid Data Management SaaS Offerings, whose focus is on ensuring product delivery is operationally scalable. His areas of expertise spans key SRE Tenets such as Observability, Toil Reduction, & Continuous Improvement.

 


SRE-078: Virtual Machine Provisioning Availability

Presenter Name: Priya Maria Rao

Theme: Learning from Incidents

Description: In IBM cloud we see a few issues that impact the VM provisioning availability. I would like to talk about an issue that was seen very frequently and the steps that Compute Services SRE took in order to monitor and mitigate the issue. The VMs which were using the Red Hat OS were failing provisioning intermittently. This was suspected to be due to an underlying network connectivity issue with capsule server or DNS server or edge server. It was difficult to debug and root cause because the VMs would get deleted by the customer who have faced the problem. So Compute SRE has implemented observability and monitoring dashboards that specifically catch RHEL related VM provisioning issues and then run ping and health check scripts from within those test VMs. We were then able to debug the test VMs to identify infrastructure issues and created Runbooks to remediate them. This talk will cover the test VM provisioning aspects and the dashboards created for monitoring and alerting.


SRE-081: Manage multiple landing zones, across Private and (multiple) public clouds

Presenter Name: Deepak Bhayana

Theme: Reliability in Hybrid Cloud and AI

Description:

Across organisations, consumption of services from multiple cloud providers is booming due business line preferences, avoiding vendor lock-in, mergers and acquisitions or simply ensuring business continuity through resilience and disaster recovery solutions. Organisations need guidelines about services or setup details, cloud vendors prepared Landing Zone which is a configured environment with standard set of secured cloud infrastructure, policies, guidelines, and centrally managed services to deploy enterprise workloads. These points for managing multiple cloud or their landing zones: 1. Using Policies: Everything we do in cloud is software- and code-defined. As cloud supports multiple customers at one time, any maintenance or patches need to be done without downtime and thus, they make sure that code is safe, system patched on time and mistakes like delete disc etc are managed well. They used policies, management tooling concepts to resolve, policies – IAM, Stack, blueprints etc. 2. Orchestrating Policies for multi-cloud: Need is to have single repository to store, manage all policies. Technically all cloud providers support JSON as a programming format, can have an abstract logic from the code itself and thus, to present policies with certain logic. 3. Demarcation: It is about separation of duties, who is responsible for what in IaaS, PaaS and SaaS computing. We do not want global admin rights all over our estate, need minimum necessary access, use Policy of Least Privilege (PoLP) concept and IAM processes defined to set the matrix.


SRE-082: A Slackbot for silencing alerts

Presenter Name: Jarosław Wachel

Theme: Reliability in Hybrid Cloud and AI

Description: We developed a maintenance slack bot to engage when we need to pause alerts on a system. Let’s face it, from time to time, we need to perform an action that may produce alerts on a system, whether that’s due to a customer request during debugging or a test system in the production environment. During such operations, we don’t want to unnecessarily alert our SREs as a) it’s noise; b) we don’t want people stepping over each other. That’s where our Slackbot comes into play: select a system and an end time and that target system will not surface any alerts to SREs for the duration of the window. This is critical as there are some alerts for which our automation tool reacts to. e.g. Some components have self-healing enabled so if an alert triggers, it will be attempted to be automatically healed and thus, conflict with the original intent. Plus, we don’t want noise in our data!

 


SRE-083: SREs – Learning form the failures/incidents (Be the Tony Stark of the pack)

Presenter Name: Kaushal Kishore

Theme: Learning from Incidents

Description: Everyone may be thinking why or who is Tony Stark? So I am a big Avenger fan. SREs are no less than an avenger. Like Tony stark (Iron man) learns from every failures/mistakes he did while gurading the world and evolves his armor and suit, we need to do the same as SREs.
Here Being SRE its expected from us to be able to fix and incident/failure when it happens in production system and yes we do that. Something that as SRE one should practice & be proud of is not only putting off the fire but also finding the root cause of the fire and resolving it. There are few tools that as SREs we can use to strenthen our armour and fight back better for the next time.

  1. Documentation/Runbooks: Documents/runbooks are the transcripts of your road to success. They not only help you in future but also the parallel set of folks working along your side. Everytime a new issue/failure is seen should be documented well enough and a Peer reviwed. (If you didn’t document, you didn’t do it.)
  2. Instrumentaion/Monitoring/Alerting: Everytime a failure happens wihin the system it’s a series of chain reaction that might have triggred it. Each trigger point must be kept under check. Also, if we want to be a step ahead a Predective Anomaly Detection Model can be setup which can be as simple as scanning through the log files or scanning through the time series metrics to alert us before a failure happens.
  3. Knowledge Sharing: The failures must be taken as an opportunity to learn – implement – educate. Similar to our HA (Highly available) setups. Single point of failures in infra as well as people must be avoided.

Keywords: #learningfromfailures, #SRE_shield & #SREanAvenger

Speaker Bio: I am Kaushal working in Software Organisation in IBM. I have total of 6.5 years of experience all into SRE domain. It’s been close to 17 months in IBM, working on Planning Analytics on Cloud as SRE to maintain, manage and deliver the solutions to the customer. Reliability Engineering is something I am passionate about and try to spread the understanding of this excellent practice/profession across.


SRE-089: Managing Monitoring Systems with GitOps and CI/CD Pipelines

Presenter Name: Denis Medeiros

Theme: Reliability in Hybrid Cloud and AI

Description: Monitoring and observability are key items in site reliability engineering and engineers use different tools to collect and analyze data, logs, metrics, traces, etc. Given the importance of these tools, it’s essential that they must be stable and configured correctly and this can be challenging when they are being used in dynamic environments, with multiple users and teams making configuration changes. To address this, our SRE team decided to apply GitOps strategies and techniques to manage those tools and in this presentation I am going to share our journey to have this model implemented and adopted by SREs.

Our goal was to have all monitoring systems configuration (alert, events, and notification definitions, dashboards, etc.) defined as code in Github (defined as the source of truth), and all configuration operations processed by a CI/ CD pipeline. Regular users should have only viewer permissions on the monitoring systems, and any change in the configuration needs to proposed by opening a pull-request in Github, which needs to be reviewed and validated by other peers. When the pull-request is opened or modified, the pipeline runs automatic tests and other validation functions to ensure the code changes meet the required conditions and do not break any configuration rules. Finally, when the pull-request is approved and the code merged, the pipeline will deploy the configuration changes and log all operations.

The main advantage of this approach is that we can make configuration changes at scale. Moreover, we have a better change process where we can reduce accidental configuration errors, and with the access to the whole history of changes stored in Github, we have an easy process to roll back unsuccessful changes.

Finally, I also want to discuss some of the challenges we faced to implement and maintain this process, and how we overcame them.

Keywords: GitOps, CI/CD, Monitoring Systems

Speaker Bio: Denis is a Site Reliability Engineer at IBM, within the Business Analytics organization. With a diverse background in systems and network administration and software development, in the past few years he transitioned towards the DevOps field and most recently to SRE, area which he became passionated about. In his work, Denis develops and maintains different automation tools, many of them used by other SRE teams in support of Business Analytics.


SRE-090: Making of the SRE Omelette – Path to business and client success outcome

Presenter: Kevin Yu

Description:Have you wondered how industry leaders and executives measured SRE success and ROI? What can you learn from speaking to them on the subject of how they influenced SRE outcome? This session summarizes insights I gained from interviewing industry leaders and executives on what they consider as key ingredients for success. As well as their recipe and path to achieve the desired outcome.

Keywords: Business Results, Client Success


SRE-101: Velos: Combining Passive and Active Monitoring For SLO Failure Diagnosis

Presenter Name: Saurabh Jha, Robert Filepp, Jesus Rios

Theme: Reliability in Hybrid Cloud and AI

Description: Recent innovations in computer science, engineering, and life and social sciences are driving the need for reliable and performant large-scale systems and software production deployments. However, increasing scale and complexity of the deployments coupled with increasing user demands, software and hardware defects lead to costly outages and significant performance degradations, as evidenced by the recent Uptime Institute’s 2022 Outage Analysis Report and newsworthy headline-making outages. There is sufficient evidence that data-driven AIOps models can help automatically identify service-level objective (SLO) violations, their diagnosis, and mitigation. However, developing such models is challenging because:
(i) Different datasets have different diagnostics power depending on the problem; hence, the AIOps model must select the most relevant features.
(ii) Passive observational data often is insufficient for detection, diagnosis, and mitigation of SLO violations; hence, the AIOps model must be flexible and incorporate techniques for active dataset collection (e.g., via probes) while ensuring minimal interference to system performance.
(iii) Trained models may not generalize, leading to poor accuracy with evolving code or infrastructure changes; hence, the AIOps model must be flexible to incorporate code or system-level changes.

In this demo, we will outline Velos that is capable of detecting and diagnosing the cause of performance-related SLO-violations at runtime. Velos is crafted to address the above-outlined limitations of the current state-of-the-art AIOps model. Velos uses advanced causality-driven machine learning algorithms and active probing techniques to identify and diagnose the probable cause of an SLO violation. The causal nature allows an SRE to reason about the generated results and update the model through a UI if necessary. We will demonstrate Velos on a Kubernetes cluster monitored via Instana.

Speaker Bio:

  • Saurabh is a Research Staff Member at the IBM Research Division. He holds PhD in Computer Science from the University of Illinois at Urbana-Champaign. His work is at the intersection of Machine Learning (with a particular interest in causal and generative models) and Systems (focusing on dependability). He has numerous technical papers in AIOps and won several best paper awards for his work. He has worked with the SRE team of the Blue Waters supercomputing facility at the National Center for Supercomputing Applications (NCSA) where he successfully deployed an active probing solution for diagnosing failures.
  • Robert Filepp is a Senior Software Engineer working at IBM’s T.J. Watson Research Center since 1999. Prior to joining IBM Research, Robert held many roles in software development including Senior Manager of Systems Architecture at Prodigy Services Company, and Director of Data Architecture at Simon & Schuster. Since 2005 Robert has primarily been focusing on workload and server provisioning, scheduling, deployment, and compliance in environments ranging from Grid, to VMs, to Cloud.
  • Jesus is a Research Scientist at the IBM Research Division. He holds a PhD in Mathematics and Computer Sciences. He is an inventor of 10 patents and has authored over 50 scientific publications in computer science and mathematics. He joined IBM in 2010 and specializes in Artificial Intelligence (AI). He currently works on applying AI to problems in the IT domain.

SRE-109: Can I Trust Anomalies Detected from Unlabeled Data?

Presenter Name: Xi Yang, Ian Manning, Laura Shwartz

Theme: Reliability in Hybrid Cloud and AI

Description: One of the most critical tasks in AIOps is anomaly detection, which is essential for SREs to figure out potential clues for anatomizing the failures and improve the systems’ reliability. Developing robust anomaly detection models is challenging due to the low signal-to-noise ratio, lack of accurate and crowdsourced labeling, and various abnormal patterns across measurements and over time. In this demo, based on metrics data, we aim to outline an ensemble unsupervised data-driven machine learning framework to detect anomalies with high significance. The framework can be learned without any labeling efforts and refined periodically when being applied to the streaming data collected in production systems.

Keywords: Anomaly detection, Unsupervised learning

Speaker Bio:

  • Xi Yang: A researcher focusing on temporal data analysis using machine learning and data mining techniques. She has published multiple conference papers in IJCAI, ICME, IEEE BigData, etc.
  • Ian Manning: A Senior Technical Staff Member in IBM. Over the past decade he has been leading the development of multiple IBM products that use analytics to address real-world challenges . He has 12 patents in data monitoring and analytics.
  • Laura Shwartz: A researcher focusing on infusing artificial cognition into tools and processes of IT technical services; co-authored 3 book chapters, 75 papers, and 70 patents.

 


SRE-110: The Joy and Challenges Of Owning a Service with One of the Deepest Stacks on IBM Cloud

Presenter Name: Ephraim Petry, Jaspreet Singh

Theme: Reliability in Hybrid Cloud and AI

Description: Our team hosts and supports the secure Hyper Protect Crypto Services on IBM Cloud. With that comes the distinction of supporting a service that has one of the deepest stacks that starts from physical cards to the latest polyglot microservices running on IBM Z machines. It is also one of the most complex services that involved multi-region global high availability across world wide data centers, with a request flow starting from IBM Cloud console, spanning across Virtual Private Cloud and state of the art IBM Z servers. Come join us as we share the experience, joy and challenges of a globally distributed Site Reliability Engineering team that ensures stability and reliability of a service composed of a variety of both IBM Cloud network and software services that form building blocks of our offering, in the process interfacing and interacting with multiple IBM Cloud teams.

Keywords: #reliability #sre #security #ibmcloud #IBM-z

Speaker Bio:

  • Ephraim Petry: ZaaS SRE Architect
  • Jaspreet Singh: ZaaS Senior SRE

SRE-111: How to make Hybrid Cloud solutions more reliable

Presenter Name: Ope Jegede

Theme: Reliability in Hybrid Cloud and AI

Description: Enterprises are looking for new ways to respond to change and elevate speed to market. Whether they are looking to disrupt or they themselves are being disrupted by the innovative market. Making the Hybrid cloud solutions more reliable requires putting the entire business life-cycle in a secure cloud.Using machine learning observability, security, traffic shaping, infrastructure provisioning, usage, patterns, recurrence to map every technology, process, people, location and security throughout the entire lifecycle. There exists various architectural diversity problems in the communication of various heterogeneous infrastructure, stochastic user behavior, computing hierarchy, scale distribution, anti-pattern, never ending vulnerabilities etc. The impact of all the mentioned problems leads to lots of operational challenges like increase in the mean time between failures, Mean Time to Resolution(MTTR), application / service performance degradation, Memory Leaks etc. How do we make cloud and hybrid cloud solutions more reliable using AI? How will it help eliminate inconsistencies in code development?

Keywords: #HybridCloud, #EdgeComputing, #Cloud, #ArtificialIntelligence, #Reliability, #ReliabilityEngineering, #AIInfusedSolutions

Speaker Bio: Ope Jegede is the IBM Cloud ServiceNow Solutions Architect, with eighteen years of experience in the Government FedRAMP High DoD Level 2, Financial, Healthcare, Oil & Gas industries. He has deep experience in “leading the Next Generation of Smart & Lean IT, Business Process Automation, Intelligent IT Governance, and Interconnected Infrastructures delivering value-capture baselines”. His expertise includes architecting cloud solutions, silos elimination, automation expansion, CI/CD pipeline orchestration, and architecting complex SCADA systems. He has the courage and persistence needed to innovate by challenging the current status quo, to discover new ways in which technology can help enterprises save money, eliminate waste/silos, improve compliance, and gain more visibility across owning assets.


SRE-114: Being responsible by giving back: Open Source tooling and sustainability

Presenter Name: Gerald Mitchell

Theme: Responsible Computing and Sustainability

Description: Understand how different sources of code affects us: the services we rely on, the tools we use, and likely the libraries we build our services upon. The crucial missing piece to reliability and sustainability when using Open Source is you!

This will cover what is closed-source, inner-source, open-source, forking, a review of history, and an overview of license models.

Understand what is needed for responsible computing with sustainability, how to get involved, why open source is crucial, and why the SRE practitioner needs allocation from the business to participate.

Speaker Bio: Gerald Mitchell is a developer and product architect for IBM Developer for z/OS in the IBM DevOps on z hybrid cloud portfolio; he has worked in many roles for IBM including Host Access middleware full stack development, services, build, and product architect, product support Global Response Team, Rational brand Serviceability Architect, Jazz Foundation core engineering, and various technical services roles.


SRE-115: Astounded – how an IPMI firmware update automation caused an outage across Production regions

Presenter: Pradeep Kumar Errammagari

Description: What is IPMI ?
An Intelligent Platform Management Interface (IPMI) is a hardware solution (chip) for controlling and managing Bare Metal servers.

The IPMI resides on Bare Metal (BM) servers. The BMs such as Haproxies and Vyattahas in a region are very important and critical servers that need to be up and running at all the time (Primary/Secondary) to keep the network communication up in a region. If both Primary & Secondary BMs are down then entire region will be shutdown!

This talk is about sharing the learnings from a Production outage that was caused by using a partially developed automation. It covers below points:

  • How an outage occurred
  • Why a partially developed automation was used to update IPMI firmware across all Production regions
  • What are the communication gaps that had played a role in this incident
  • The learnings taken from this incident

Keywords: Learning from Incidents

Speaker Bio:

  • Errammagari PRADEEP KUMAR
    He is currently working as squad lead for IKS SRE India squad. He has been working with IKS SRE for 7 years and with IBM for 23 years.

SRE-123: What do SREs do when they are not firefighting? Using On-Call SRE’s Down Time to proactively focus on reliability of your environments

Presenter: Rick Brunkhorst, Lakshmi Chittineni, Sandhya Ghimire

Description: As the Hyper Protect Services SRE Lead Architect, how do I funnel engineering innovation into SRE, provide reliability in Hybrid Cloud, and reduce toil for SREs? One way of doing that is by having SREs use their “on-call” time, when they are not firefighting, by not working on planned work. Instead of working on sprint deliverables, the SREs follow our on-call “blueprint” (will be shared) and are expected to perform proactive monitoring of the environment through dashboards, and on the glass monitoring, looking to catch problems before they happen.

Through these experiences SREs should then use this to funnel inspiration into the engineering aspects of SRE. The SREs should be able to identify areas needed for improving monitoring, alerting, and even opportunities for automation. Doing all of this while relieving the stress of context switching of a planned deliverable.

Keywords: On-Call, Pager Duty

Speaker Bio:

    • Rick Brunkhorst, Hyper Protect Services SRE Architect:Rick is 17 year IBMer from North Carolina, US. He graduated from North Carolina State University with a B.S. degree in Computer Science. At his tenure in IBM, Rick has worked exclusively on IBM Z, with early days as a co-op and then a tester on IBM Developer for z/OS, then as a team lead developer on IBM Z Development and Test Environment. Afterwards he moved to IBM LinuxOne platform as a developer working on IBM Blockchain on IBM Z, and then moving into SRE in 2020 as team lead for IBM Hyper Protect Services Infrastructure SRE. In 2021, he became the architect for IBM Hyper Protect Services Infrastructure SRE, and then in 2022 became lead architect for IBM Hyper Protect Services SRE.
    • Lakshmi Chittineni, Hyper Protect Services IaaS (Infrastructure) SRE:Lakshmi is a Hyper Protect Services IaaS SRE, IBMer base in Toronto, CA. She graduated with a Bachelors degree in Electronics and Computer Engineering from SASTRA university in India. She has almost 5 years of experience as a DevOps engineer and SRE. She started her career at Tata Consultancy services (TCS) as a DevOps engineer, where she gained expertise in building CI CD pipelines, deployment & maintenance of spring boot applications on to Azure cloud . Lakshmi joined IBM in September 2021 and works as a Hyper Protect Services IaaS SRE where she used her software engineering expertise to continue to improve reliability and availability for IBM Z in IBM Cloud.
    • Sandhya Ghimire, Hyper Protect Services IaaS (Infrastructure) SRE:Sandhya is a new IBMer based in Massachusetts, US. She is currently pursuing a Masters Degree in Science IT from University of Massachusetts – Expected Graduation: May 2023. She started her career at Tufts Health Plan as System Engineer where she has gained experience on application development, support and releases build. She then moved to Crunchtime, where she gained experience on Linux system, production deployment and support based on GCP. She later joined IBM in 2021 as a Hyper Protect Services IaaS SRE working on the IBM LinuxOne platform.

SRE-130: How not to write a project profile for SRE certification

Presenter Name: Uzma Siddiqui

Theme: Reliability in Hybrid Cloud and AI

Description:  As an SRE with two project profiles approved for SRE Level 1 certification, I will share what I have learnt not to do while writing a project profile, based on reviewer feedback. This talk will focus on 5 pitfalls that can detract the reviewer from evaluating the SRE skills you want to showcase through your project profile.

Speaker Bio: Uzma Siddiqui works as an SRE for the API Connect cloud offering on IBM Cloud and AWS.

 


SRE-137: Build a reliable, compliant IBM cloud based event platform to enable Observability and real time analytics for complex architectures like IBM Cloud VPC service 

Presenter Name: Subhajit Patra, Lekha Rao

Description: Observability signals traditionally are about logs, metrics, tracing records, basic events. But logs are expensive to read and process, metrics-based tooling systems can only deal with low-cardinality dimensions at any reasonable scale and basic events don’t provide sufficient information to identify state of a system.

This talk will focus on the direction in which observability is moving towards for complex, cloud native architectures. How to employ using high cardinality and high dimensionality of data for building your event based observability platforms and how to use complex, structured events to define state of the system.

We will focus on how we built a reliable, scalable eventing platform real time analytics for VPC using IBM Cloud based services like IKS, Toolchain, Data engine, COS and Event Streams. The approach we took to define events and how they are useful to detect system problems. We will also focus on how observability is moving towards an event based architecture to enable real time analytics.

Keywords: Observability, Events, Telemetry, IBM Cloud

Speaker Bio:

  • Subhajit Patra: Over 16 plus years of experience in the IT industry, with extensive experience in all aspects of software development, deployment, and implementation with proficiency in design & development using Python, Rest API’s, Infrastructure, Ansible, Java and flask projects. He has delivered high quality and well-received business process improvements across service functionality, availability and performance enhancements for diverse user base. He has experience in building best in class Hybrid cloud and Analytics solutions and is currently building Observability platform event based real time streaming analytics using IBM Cloud services and leading a team of 6 people.
  • Lekha Rao has over 10 years of Industry experience. She is currently managing and leading the IaaS Observability mission in India and is responsible for establishing Observability platforms and processes for metrics, events, alerts monitoring to enable SRE’s and developers across IaaS VPC to better identify and diagnose problems with confidence.

 


SRE-140: IBM Cloud Databases – Improving alert rates and system stability

Presenter Name: Cheryl Brown, Tim Waizenegger

Theme: Learning from Incidents

Description: 

The IBM Cloud Databases team operates tens of thousands of databases in production, which run on IKS clusters with approximately 17,000 worker nodes total across the estate.

Over time, as IBM Cloud Databases has grown significantly and despite our efforts to refine alerts, alert rates have increased and many became noise for on call personnel. In 2022, the team has been focused on reducing alerts in order to improve on call experience and stability for our customers. So far, we have achieved an impressive 25% alert reduction, while the number of databases in production has grown by 8%.

In this session we will share our multi-faceted approach to reducing alerts including alert tuning and inhibitions, automation to handle alerts, code fixes, and more. We will also share lessons learned, and plans for future improvements on the stability of IBM Cloud Databases.

Keywords: Alert-reduction, auto-healing, stability, automation

Speaker Bio: 

  • Cheryl Brown is a senior software engineer with over 15 years of experience in the field. She is currently based in RTP and is a member the IBM Cloud Databases team. She believes getting to the root of a product or team’s biggest problems and working diligently to make improvements is one of the best way to drive results. Cheryl has delivered SRE efficiency gains into IBM Cloud Databases, and data loading and API performance improvements into IBM Marketplace, IBM Control Desk, and IBM internal security tooling. Cheryl holds a Bachelor of Science degree in Computer Science from Case Western Reserve University.
  • Tim Waizenegger has been part of the IBM Cloud Databases team from the beginning of the project in 2018. He lead the development of the Elasticsearch and Datastax database products and today is responsible for the common Kubernetes based platform that all database products use, as the platform squad lead. His main focus is driving the development of common components that benefit all database products. This includes stability improvements, automation for manual operations tasks as well as improvements to our development and CI process. Before joining IBM Cloud Databases, Tim helped the IBM Tools-as-a-Service team to kick-start their IBM-wide Kubernetes-based Jenkins service. Prior to joining IBM full-time, Tim received a PhD at the University of Stuttgart for his research into Cryptographic deletion in cloud object stores. Tim is based in the IBM Research and Development Lab in Boeblingen, Germany.

SRE-141: Embracing SRE as Development Manager

Presenter: Nick Stielau

Description: SRE practice and mindset is a fundamental shift for organizations delivering services. This requires much more than “simply” building out world-class SRE capabilities. It requires changes to development teams and workflows to support, enable, and effectively collaborate between development and SRE, be it on an individual or team basis. We’ll cover some interaction points between development and SRE teams, and actionable approaches for effectively delivering service-based offerings.

Keywords: Management


SRE-144: Associate retention during The Great Resignation

Presenter: Dan Keohane

Description: Associate retention begins even before your associate’s first day on the job. In the age of The Great Resignation, it is critical for teams to be intentional in their strategies to retain hardworking and talented associates. Join Dan Keohane, a member of Red Hat’s OpenShift SRE team, as he tells the story of how his team has endeavored to attract and retain some of the greatest engineers in the field. From interviewing and onboarding practices that help new hires feel welcome and enabled to hit the ground running, to strategies that help tenured associates feel understood, purposeful, and empowered within their roles, this talk will provide a blueprint for how to hang on to the people who make your team great.

Keywords: Management, Retention


SRE-145: Eliminating Toil – The Managed OpenShift SRE Approach

Presenter Name: Sam Nguyen

Theme: Reliability in Hybrid Cloud and AI

Description: Running a production service has one constant no matter how big we scale: toil. Toil is manual, repetitive work that is reactive in nature but can be automated with good engineering. In this talk, we will discuss how our Openshift SRE team prioritizes toil reduction while developing the most in-demand features for our platform. Having a strategy to review operation tasks effectively and at a sustainable pace allows us to spend more time to engineer solutions that improves the reliability of our systems. We will deep dive into the kind of data that drives our toil review meeting, the right tools to use, and the audience needed to come out with actionable solutions.

Keywords: Toil

Speaker Bio: Sam is an SRE on the OpenShift Platform team at Red Hat, currently serving as the Regional SRE Lead for Eastern North America. As one of the Region Leads on the team, he has combined his engineering and operations knowledge to help his team maintain the continuity of a global team, coach SREs on operational topics, and drive engineering improvement including toil reduction. Outside of work, he likes to play sports and spend time with his family!


SRE-146: How incidents drive down incidents – Success story of incident management of an airline company after hybrid cloud migrations

Presenter: Timothy Leung, Tim Siu, Charles Chung

Description: Our project team takes care of multiple mission critical systems of a major Hong Kong airline company. The company took a bold move last year to migrate most of their applications to hybrid cloud (including AWS/Azure) with high percentages of native cloud services involved, and comes with unforeseeable incidents to the systems. Unlike on-prem environment, the characteristics of cloud environment brought more complexity, introduced new challenges and surprise to the support team. In 6 months of time, we have developed the experience on hybrid cloud system managements and control the numbers in a very low level (reduce to less than 2 incident per month per application). We would like to share the experience with our audience, the way we learn from the incidents, how we organize the incident with the client together and ultimately how to drive down the incidents. We believe these experiences can bought insight to organizations who is planning to jump on the hybrid cloud era.

Keywords: #SREInAirline #HybridCloud #CloudMigration #CloudIncidentHandling #Reliability

Speaker Bio:

  • Timothy Leung, lead account architect. Focus on Hybrid Cloud system development, SRE and legacy application migration. Interested on SRE, AIOps and performance tuning matters.
  • Tim Siu, senior system analyst. Focus on cloud-related system development and SRE area. Responsible for enhancing customers’ system reliability, maintainability, and realize the value of hybrid cloud to customer in different aspects.
  • Charles Chung, system analyst. Focus on cloud related development, hybrid cloud architectural design and SRE enthusiast. Migrated more than 40 applications and highly experienced in post migration BAU support and optimization.

 


SRE-148: Strategies to reduce on-call fatigue

Presenter Name: Marion Clelland

Theme: Learning from Incidents

Description: Phone won’t stop ringing? Can’t remember when you last slept? Do you see bash script when you close your eyes? Here are some ways I have helped the teams I work with find their sanity and bring the SRE joy back.

Speaker Bio:

  • Marion Clelland is currently a Senior Engineering Manager responsible for development and SRE within IBM Cloud.  Marion launched SRE within a traditional software development culture working through process, technology and social challenges with a hugely successful business impact and mindset shift in a department of 200+ people.  Marion is the first female at IBM to have achieved Thought Leader certification in the IBM Site Reliability Engineering profession.  In 2018, Marion was a winner of the TechWomen100 external industry award for her impact on IT culture as a female role model.

SRE-151: AWS Well-Architected Framework

Presenter Name: Srinivasa Naga Kadiyala

Theme: Reliability in Hybrid Cloud and AI

Description: A quick overview of AWS’s well-architected framework and reliability pillar.

 


SRE-152: What are the right measure for a SRE Organization?

Presenter Name: Sunil Joshi

Theme: Reliability in Hybrid Cloud and AI

Description: Generally, in a large scale program, there are several measures across the lifecycle (business planning, design/dev through post deploy activities). Measurements also comes with a cost (labor, performance implications, reporting technology etc.). To becoming a mature SRE organization, it is essential to clearly identify what measures makes the most sense, and in several cases, it is persona based (what might be interesting to a CIO may not be for a delivery director, and vice versa). This session will highlight a persona based KPI/OKR framework, as a means to meter, and report meaningful measures to appropriate stakeholders to maintain the health of a transformation program and post implementation.

Keywords: SRE, KPI, OKR, value metrics

Speaker Bio: Sunil is a Vice President and Distinguished Engineer at IBM and currently the CTO for Hybrid Cloud Services, Americas. Sunil is a regular speaker in the industry on Cloud, DevOps, Site Reliability Engineering, digital transformation and related concepts. His specialization are in the areas of hybrid cloud solutions, platform as a service and DevOps/SRE strategy. Sunil has authored several IBM Redbooks, blogs and articles.

Sunil is passionate about international music, multi-cultural cuisine, active sports and mentoring school & college students on career path and technology.

 


SRE-153: Using the Java Quarkiverse Operator SDK to improve reliability in distributed Kubernetes applications

Presenter Name: Dean Walter Parker

Theme: Reliability in Hybrid Cloud and AI

Description: Kubernetes itself and most Operators are typically written in the GO programming language for high speed and low memory footprint. However, the vast majority of enterprise scale applications being modernized or containerized in the cloud are written in Java. Modern DevOps principles mandate that developers should also be responsible for certain operational aspects of their applications. This applied technology demo will show how to leverage existing Java skillsets to improve reliability in production using highly intelligent Kubernetes Operators.

This pattern and technology stack was used successfully in production at a large insurance client. Kubernetes Operators are often viewed as too complicated and beyond the scope of a typical developer’s responsibilities. This demo will show how easy this can actually be and detail the benefits of intelligent, custom real-time monitoring and analysis, dynamic manipulation of Kubernetes resources and even custom alerting using Operators written in natively compiled Java.

Speaker Bio: Dean is a certified Openshift architect and a full stack architect with 30 years of software development experience. Now with IBM for 3 years, he is part of the CTO Group for IBM Consulting’s SRE practice and is also a frequent instructor for both Complex Solutions Architecture and Site Reliability Engineering courses.

 


SRE-154: Experiences in the networking area during the move of an insurance platform to cloud native

Presenter Name: Vinit Jain

Theme: Learning from Incidents

Description: IBM Consulting moved a large insurance platform to OpenShift on IBM Cloud. This talk will focus on the challenges encountered in the networking area and how those were overcome by applying SRE principles.

Speaker Bio: As an Executive Architect with IBM Consulting, Vinit leads solutioning and implementation of Cloud Native Architectures for Fortune 500 clients. He is a Certified Cloud Architect, Member of IBM Academy of Technology and an IBM Master Inventor.


SRE-156: Benefits and practices with and without SRE through a real life outage

Presenter Name: Christine White

Theme: Learning from Incidents

Description: This session will highlight the key practices and benefits of SRE by showcasing a side by side post mortem of how and SRE org versus a non SRE org dealt with the same outage.

 


SRE-157: Embedding SRE in an organization from a manager and leadership perspective

Presenter Name: Cynthia Unwin

Theme: Responsible Computing and Sustainability

Description: This session will examine many of the cultural aspects of SRE. We will highlight what has been lessons learned from other clients and when to look for and what to change.

 


SRE-159: Reliability Engineering for RISE with SAP

Presenter Name: Naveen Purushothaman, Dharma Teja Atluri

Theme: Reliability in Hybrid Cloud and AI

Description: IBM for RISE with SAP, the premium supplier option includes IBM Cloud as IaaS, implementation and technical management services in one offer. In this session we will explain how we engineered, implemented an integrated Control Plane for RISE with SAP by bringing together capabilities from IBM Cloud, Power Systems, Red Hat, Research, IBM Consulting and IBM Security BU into one integrated solution. Salient features of the solution include GitOps based fully automated landscape orchestration and provisioning along with instrumentation for operations (build-to-manage) and integrated next-gen service management for day-2 operations using AIOps/ChatOps. Technologies/tools such as Instana, LogDNA, QRadar, Qualys, SAP FRUN, IBM Consulting Digital Operations, ServiceNow and PagerDuty the key tools used for the service management toolchain.

Keywords: #sap-sre, #sap-platform-engineering, #rise-with-sap, #sre, #platform-engineering, #ibm-cloud, #ibm

Speaker Bio:

  • Naveen is the Chief Architect for Platform Engineering at IBM. His 22 years of experience spans across infrastructure management, cloud platform management, DevSecOps, AIOps domains and IT Specialist, Architect, SRE professions.
  • Dharma Atluri has around 21+ years of experience leading digital transformation for clients globally advocating hybrid cloud and AI adoption enabled by business value models. He is currently leading the RISE with SAP program as the chief architect along with being the lead solution architect for NextGen AMS manage solution leveraging DevSecOps, Automation, AI and Process mining driven by data insights.

SRE-160: SRE expansion – start by a differentiated learning model

Presenter: Carlos Salviato

Description: The Learning process can hold or boost the adoption of anything (the car accelerator).

On this presentation you will learn about the SRE Academy, a program that aims to create a different experience for participants creating an continuously movement of learning – forming habits tho culture.

The intention is that the same experience be presented on a involvement way to bring to the group a pocket flavor of Academy SRE – resolving debugging a complex problem in a funny way.

Keywords: #academia_SRE #SRE_academy #learning #sre

Speaker Bio:

  • Carlos Salviato – Business Transformation Consultant, has 18+ supporting clients since in Business Operations, Infrastructure (Datacenters) and in the last years focused in the Banking & Finance industry, helping client I new ways of working and Technology adoption.

 


SRE-166: Meeting the Auditor Requirements

Presenter Name: Andrea C Martinez

Description: For many enterprises, especially those operating in regulated industries such as Financial Services, Insurances, Health Care, Energy & Utilities, and Travel & Transportation, traditional approaches to meeting the requirements of external auditors fall short in a digital world where multiple releases can be deployed in production weekly, or even daily, at a click of a button. A group of IBM thought leaders undertook an investigation into how regulation and compliance requirements can be met without forgoing the benefits of DevOps and SRE. The objective was to describe how organizations can reap all of the benefits of DevOps and SRE while satisfying supervisory mandates and regulatory requirements. Come join this session, hosted by Andrea C. Crawford, a leader in the IBM Academy of Technology, where the results of the study will be summarized.

 


SRE-167: Adopting SRE & impact to the organisation

Presenter Name: Sunil Joshi

Theme: TBC

Description: IBM Client Panel


SRE-168: Becoming a Site Reliability Engineer

Presenter: Colin Thorne

Description: Join our host Colin and a panel of IBM Site Reliability Engineers answer questions on how they navigated towards a career in SRE. E.g. what skills are important, what is a typical week like, and what does the career path look like.

Keywords: Discussion


SRE-169: Green IT – A holistic approach through Sustainable Service Design

Presenter: Prof. (Dr.) Bhuvan Unhelkar

Description:

This presentation by Prof. (Dr.) Bhuvan Unhelkar starts with the motivating factors for sustainable designs – which invariably deal with business interests. Sustainable Service Design is then presented in terms of iterative and incremental (agile) approach encompassing principles of Green ICT. Environmental Intelligence (EI) as an application of AI is also described.

Keywords: Sustainable Service Design; Green ICT; Environmental Intelligence (EI); Artificial Intelligence (AI)

Speaker Bio:

  • Bhuvan Unhelkar is a Professor of information technology in the School of Information Systems and Management, Muma College of Business in Sarasota-Manatee campus. His research focuses on big data strategies/AI, Agile processes and their application in practice. He teaches IT and Project Management courses at undergraduate and graduate level. He holds certifications as Business Analysis (CBAP), Professional Scrum Master (PSM), Software Quality Assurance and Training & Assessment/Education. He has written or co-written 27 books, chapters and research articles for numerous publications including the Cutter Business Technology Journal, the Scandinavian Journal of Information Systems, the Journal of Information Technology & Tourism, the International Journal of Mobile & Adhoc Network and the Global Journal of Finance and Management. Unhelkar earned his PhD and Master’s degree from the University of Technology, Sydney; MDBA from Pune and his Bachelor in Electronic Engineering from the M.S. University of Baroda.


SRE-180: AI Infused Right Sizing

Presenter: Abhay Choudhary

Description: Leveraging AI for responsible computing

Keywords: TBC


SRE-181: Sustaining everything, everywhere, all at once!

Presenter: Fanjing Meng

Description:

Sustainability is one of the hot topics in addressing the challenges of climate change. When we put it in the context of Site Reliability Engineering, it means improving the power consumption efficiencies of the whole stack (from infrastructure to application) while ensuring the SLAs/SLOs of the services they support.

However, it is very hard for an SRE to manage the optimal resource management and smart workload scheduling of the services they support. There are many variables to consider – flexible resource pricing models, dynamic time-based electricity pricing models, multiple cloud resources pricing models and more.

Is it possible to create a sustainable architecture without compromising on reliability? One of the basic tenets of reliability has been the potential of over-provisioning. Can we still have redundant systems while minimizing consumption?

We think – and will show why – that the answer is yes.

In this talk we will start by discussing the concepts and technical challenges of a full-stack sustainability optimization platform. Then, we’ll share the measurement systems, standards and testing approaches of sustainable computing which we have developed and use.

This will not merely be a high-level concept, we’ll present the architecture and detailed design of the whole platform. Part of the session will include sharing our practices which were developed using a data driven approach for efficiency optimization. We will also show an end-to-end demonstration of the whole platform in our data center and showcase active resource management and dynamic workload scheduling technologies.

Keywords: TBC