Incident Jan 2: Github outage

Summary

Lovable was partially or fully down for about 19 hours. The direct cause was Github disabling our Github app due to the rapid creation of new repositories. Lovable uses its Github app to clone and push to users’ repositories. The reason why Github disabled our app was that the rate of repository creations violated their terms of service and “significantly burdened their servers”. They had assured us that our usage pattern was fine when we reached out in December. The issue was solved for new projects by using a more scalable file storage (AWS S3). For existing repositories the issue was resolved when Github lifted their ban on our app.

We sincerely apologize for the inconvenience caused by this outage and are working tirelessly to minimize dependencies and create fallbacks to ensure reliability. The incident highlights our need to more quickly be able to notice and respond to outages. It has also highlighted our reliance on providers like Github and the importance of building good relationships and having clear channels of contact to these. We’ll keep working on removing risky dependencies and improving our operational excellence. To compensate our users, any credits used during the outage are of course restored. In addition, all paying users had unlimited access to the product during the following weekend.

Timeline

All times are in Coordinated Universal Time (UTC)

December 9, 2024

Given our recent growth our CTO reached out to Github to verify that there wouldn’t be any concerns with the rate of repositories and commits we’re creating on Github.

06:38: We got a response from support that there were no concerns apart from general performance issues in the UI which is not something we used so we were fine with it.

warning from github

January 2nd, 2025

22:00: Github connection suspended by Github, effectively causing a full outage of Lovable

message from github

22:01: Email notification received from Github
23:26: The first person at Lovable notices that something seems wrong. They post in internal channels but since it’s 00:26 AM in Stockholm where the majority of the team is based, no one sees the notification.

January 3rd, 2025

05:50: An engineer notices that we’re down and sees the email from Github.
05:57: An incident is created and everyone in the team is added. Based on the email the initial assumption is that projects that have not been transferred are the only ones affected since they live as private repositories in the Lovable Github organization.
06:38: We realize that our Github app is also blocked, meaning that every project is affected and we’re 100% down
06:45: All non-engineers focus on trying to get hold of Github. We go through all the normal channels. We also decide that trying to make as much noise as possible on social media is likely going to be the fastest way of getting expedited.
06:45: Engineers review how to try to restore functionality. Two main tracks are initially explored: Creating a new Github organization and app or using personal access tokens instead of our Github app.
08:10: A personal access token is rolled out and it works for projects that still remain in Lovable’s own organization, but the token is soon rate limited. We debate doing a round robin between as many personal access tokens as we have. The ban also disabled us to adding new users to our organisation. Instead we decide to investigate using github.com/awslabs/git-remote-s3. We’re confident that S3 would handle any levels of traffic that we could send it, including any thundering herd issues that might arise from enabling access again.
10:43: We roll out git on S3 for all new projects and project creations are back online and working again. Existing projects are still not working.
11:07: We roll out a fix for lazily migrating existing repos to S3 as well. We still use the personal access token so we’re migrating as many projects as the rate limit allows.
12:07: We hop on a call with AWS to make sure the git-remote-s3 library is a viable setup and won’t cause other issues. No major risks are identified.
12:14: Github support provides more clarity of why access was restricted

reason of ban from github

12:23: We respond to Github and tell them that we’ve reduced the repo creation rate to zero by migrating to S3.
12:50: We discover some issues with the repos that are migrated to S3, our production git executable assumes the default branch is called “master” while our local setup and our code assumes “main” is the default branch name.
13:05: We change the git executable default to “main”.
15:10: We’re seeing more signs that migrations to S3 are not working as expected so we pause all migrations of existing projects.
16:20: Github reinstates our app and apologizes for the issues it caused us and our users.
16:27: We revert back to using the Github app instead of the personal access token. Lovable is now back and working for 99.93% of projects.
17:30: We realize that some of the projects that we migrated are only partially migrated to S3. We start working on finding these projects and migrating them back to the private Github org instead.
21:20: The migration is ready and a first batch of projects are migrated.

January 4th, 2025

13:24: It appears that the migration we ran didn’t include all the affected projects. We start working on a new migration. We now have about 22,000 new repos that we need to analyze.

17:00: While rewriting the migration we find other issues since S3 allows multiple concurrent writes to the same project. This is a race condition that would have been prevented by Github since it handles out of sync commits thanks to it being an actual git repository. On Github, if a pushed head already exists as a commit it won’t overwrite a newer head, but there are no controls like this in S3. The impact mostly seems to be for new projects.
17:08: We spin off another workstream to address these race conditions.
20:54: We detect that 0.2% of projects that are now on S3 have both corrupt S3 as well as Github states. We start a workstream with Github to restore these projects.
22:20: The updated debug script has identified additional repositories that were affected by an incomplete S3 sync. We migrate these additional repositories as well.

January 5th, 2025

08:26: We detect that our database believes that some projects were migrated to S3 successfully but there is no data in S3, they are still in Github. We write a new migration for updating the db to point these repositories back to Github from S3.

Mitigations and remediations

Done

Add pager for critical detectors. We are in the process of rolling out an incident management system but unfortunately hadn’t enabled paging. This is now done so that we’ll be notified much faster in the case of these types of complete outages.
Create and edit new projects on S3 instead of Github.
Rollback migration that led to some repositories getting an inconsistent state between S3 and Github

Ongoing or not yet started

Refunding credits for users that tried to use the product during the outage
Improve our observability of important events by exporting into a db that is good for analytics like BigQuery or Clickhouse. We have logs but they are not ergonomic nor performant to query in situations like these.
Migrate all existing repositories in our private Github org to S3
Review other critical infrastructure partners and make sure we have a direct line to them in case of emergency, ideally a shared slack channel but at least email/phone number to someone.
- Cloudflare
- Firebase

Additional comments

The option to export and sync to your own Github repo still exists and will work as before. No projects were lost in the incident but some edits made during the outage were never applied.