Do you want a Time Travel Debugger
Sep. 30th, 2021 04:11 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I never got around to talking about what my current work do: http://undo.io There was some previous discussion on the topic on facebook: https://www.facebook.com/jack.vickeridge/posts/10103938681712440
What is a Time Travel Debugger
It records everything that happens in a program's execution, so you can step backwards as well as forwards, or rewind execution and then replay it again more carefully. Or you can "replay" it backward, e.g. going to the end of time, seeing your program crashed with a null pointer and then setting a watchpoint on that pointer and reverse-continuing until you find out where the pointer was set to that value.
There's two main modes of use, using it like a debugger sitting in front of a program, or using a companion recorder (which is actually an executable with much of the same code but packaged differently) to record your program in your overnight test suite, or running to replicate a bug that happens in a very long running process. Then once you've reproduced the bug once, you've almost finished, you can just load up the recording and step forward and back in a debugger until you figure out what went wrong.
That sounds impossible!
Yes, it does sound impossible, but it works.
It records literally everything the program does that interacts with the outside world in any way, e.g. any system call (including any file access, network access, gettime, even getpid, etc, etc), any instructions which write to shared memory, etc. That can get large for some programs (but customers do use it successfully!)
It saves a snapshot at several points during history (by forking the process there), so it can create the state of any point in history by forking another process from that snapshot, and playing it forward using the saved events instead of actually doing any of the things that interact with the outside world.
It does all this by rewriting the compiled program in memory, and maintaining a mapping between the rewritten memory and the original assembly. So you the user see the original source code and original assembly, with whatever level of debug info you originally compiled the program with. But behind the scenes, almost any non-trivial instruction is rewritten to do something else, to either to save the result of the instruction in the event log, or to replay the value from the event log.
That means that you can attach it to any program, compiled any way, just like any debugger can. You don't need to compile it with some magic -- people keep expecting this, and it could have been written that way, but instead, you can just connect it to any program you could attach gdb to.
Caveats
Recording multiple threads is slow, and recording multiple processes doesn't exist yet. We're working on it, but right now can help with some multithreading bugs but can't help with others.
Program execution is slower, between 2x and 10x. We are working to improve that. Replaying through execution can be faster than that (and you can usually go directly to the beginning, end, etc without any replaying).
This is all on linux only.
The interface and implementation is based on the gdb forntend/gdb server protocol. So by default it looks like debugging with gdb but with "reverse-next" as well as "next". And it works with any program which uses gdb backend, e.g. visual code, emacs, although some of those are tested extensively and some aren't.
But no linux debugger has a very good UI, so currently it is mainly used by people who have to debug using something like gdb anyway, but want to be able to solve harder bugs quicker. We are trying hard to make it easy for languages like python and java where the translation has to understand an interpreter as well as the code. This works in the sense that it can be recorded and replayed, but getting a good user experience is a lot harder.
Worth and Price
I always describe it as, the difference between "not having a debugger" and "having a debugger". If you have a debugger, maybe actually 90% of problems you can solve with print statements. But the 10% that you can't fix with print statements could take months to solve without a debugger, or hours with a debugger. It's hard to describe why you need a debugger to someone who hasn't tried using one. But almost no-one would go back to not having one.
A time travel debugger makes trivial the small proportion of issues that still feel impossible even with a debugger. You say, "yes, it fails intermittently but we don't know if we'll ever track it down unless someone wants to study the failure for nine months", but that might be only hours with the right tool.
Unfortunately, this tool takes a large amount of programmer effort to create, and is only viable if it's sold commercially. If you view it as "The 5% of bugs we have that take 9 months to track down, instead get solved in a few hours", you compare the cost to the salary for an extra programmer or two, it's very reasonable. But most people including me hate paying for tools, so it's hard to sell.
It has a great retention rate -- any companies which have subscribed to a contract, have almost always kept it, and programmers who have used it regularly (including me) are very very eager to keep having it available.
Currently there are several introductory offers. There's an educational license which is cheaper or free. There might be an offer of free licences to the right open source project if you're interested. There's a 30 day free trial, and a personal license, in the hopes people will become converts and persuade their employer to adopt it. There is standing offer that if you have an intractable hard to reproduce bug that's you'd like to see just go away, we can arrange some sort of trial to have someone come and help capture and diagnose that bug, and see if that leads to a longer term arrangement.
Ask questions in the comments. Feel free to download the trial -- if you've used gdb, it's fairly straightfoward to try out, and it's magical to see "step back, step forward".
Or if it sounds like you might be someone who would actually benefit from acquiring a license, I can put you in touch with helpful people -- we used to focus on big clients because there was a lot of shakedown, but now it works more reliably out of the box, it's plausible for a wider spectrum of companies and people.
What is a Time Travel Debugger
It records everything that happens in a program's execution, so you can step backwards as well as forwards, or rewind execution and then replay it again more carefully. Or you can "replay" it backward, e.g. going to the end of time, seeing your program crashed with a null pointer and then setting a watchpoint on that pointer and reverse-continuing until you find out where the pointer was set to that value.
There's two main modes of use, using it like a debugger sitting in front of a program, or using a companion recorder (which is actually an executable with much of the same code but packaged differently) to record your program in your overnight test suite, or running to replicate a bug that happens in a very long running process. Then once you've reproduced the bug once, you've almost finished, you can just load up the recording and step forward and back in a debugger until you figure out what went wrong.
That sounds impossible!
Yes, it does sound impossible, but it works.
It records literally everything the program does that interacts with the outside world in any way, e.g. any system call (including any file access, network access, gettime, even getpid, etc, etc), any instructions which write to shared memory, etc. That can get large for some programs (but customers do use it successfully!)
It saves a snapshot at several points during history (by forking the process there), so it can create the state of any point in history by forking another process from that snapshot, and playing it forward using the saved events instead of actually doing any of the things that interact with the outside world.
It does all this by rewriting the compiled program in memory, and maintaining a mapping between the rewritten memory and the original assembly. So you the user see the original source code and original assembly, with whatever level of debug info you originally compiled the program with. But behind the scenes, almost any non-trivial instruction is rewritten to do something else, to either to save the result of the instruction in the event log, or to replay the value from the event log.
That means that you can attach it to any program, compiled any way, just like any debugger can. You don't need to compile it with some magic -- people keep expecting this, and it could have been written that way, but instead, you can just connect it to any program you could attach gdb to.
Caveats
Recording multiple threads is slow, and recording multiple processes doesn't exist yet. We're working on it, but right now can help with some multithreading bugs but can't help with others.
Program execution is slower, between 2x and 10x. We are working to improve that. Replaying through execution can be faster than that (and you can usually go directly to the beginning, end, etc without any replaying).
This is all on linux only.
The interface and implementation is based on the gdb forntend/gdb server protocol. So by default it looks like debugging with gdb but with "reverse-next" as well as "next". And it works with any program which uses gdb backend, e.g. visual code, emacs, although some of those are tested extensively and some aren't.
But no linux debugger has a very good UI, so currently it is mainly used by people who have to debug using something like gdb anyway, but want to be able to solve harder bugs quicker. We are trying hard to make it easy for languages like python and java where the translation has to understand an interpreter as well as the code. This works in the sense that it can be recorded and replayed, but getting a good user experience is a lot harder.
Worth and Price
I always describe it as, the difference between "not having a debugger" and "having a debugger". If you have a debugger, maybe actually 90% of problems you can solve with print statements. But the 10% that you can't fix with print statements could take months to solve without a debugger, or hours with a debugger. It's hard to describe why you need a debugger to someone who hasn't tried using one. But almost no-one would go back to not having one.
A time travel debugger makes trivial the small proportion of issues that still feel impossible even with a debugger. You say, "yes, it fails intermittently but we don't know if we'll ever track it down unless someone wants to study the failure for nine months", but that might be only hours with the right tool.
Unfortunately, this tool takes a large amount of programmer effort to create, and is only viable if it's sold commercially. If you view it as "The 5% of bugs we have that take 9 months to track down, instead get solved in a few hours", you compare the cost to the salary for an extra programmer or two, it's very reasonable. But most people including me hate paying for tools, so it's hard to sell.
It has a great retention rate -- any companies which have subscribed to a contract, have almost always kept it, and programmers who have used it regularly (including me) are very very eager to keep having it available.
Currently there are several introductory offers. There's an educational license which is cheaper or free. There might be an offer of free licences to the right open source project if you're interested. There's a 30 day free trial, and a personal license, in the hopes people will become converts and persuade their employer to adopt it. There is standing offer that if you have an intractable hard to reproduce bug that's you'd like to see just go away, we can arrange some sort of trial to have someone come and help capture and diagnose that bug, and see if that leads to a longer term arrangement.
Ask questions in the comments. Feel free to download the trial -- if you've used gdb, it's fairly straightfoward to try out, and it's magical to see "step back, step forward".
Or if it sounds like you might be someone who would actually benefit from acquiring a license, I can put you in touch with helpful people -- we used to focus on big clients because there was a lot of shakedown, but now it works more reliably out of the box, it's plausible for a wider spectrum of companies and people.
no subject
Date: 2021-09-30 11:07 pm (UTC)no subject
Date: 2021-10-01 10:14 am (UTC)Arm has a (loosely defined) text format called "Tarmac" which logs all the activity of a CPU – what instructions it executes, what registers are updated with what values, what data is read/written from memory, and a bunch of ancillary things like exception events. When trying to debug a misbehaving piece of Arm code, I used to find it was easier to capture a single Tarmac trace of a failing piece of code and then page back and forth over it in 'less' than to run it in a Proper Debugger™, for basically the same reasons you give why time-travelling debug is useful: you don't have to re-run from the start to find out how a thing got that way, you can follow the trail of breadcrumbs backwards in time just once.
But viewing a trace in 'less' is awkward in other ways, because you can't easily find out what is currently in a register without searching backwards using a faffy regex. More so if you want to know something about the contents of memory, because memory accesses come in many sizes, so the regex would be even worse.
So I wrote a souped-up Tarmac trace browser which still lets me page through the literal trace file, but augments the UI with a display of the current registers and optionally a hex-dump view of memory, based on a complicated-data-structure index constructed by parsing the trace file at startup time. Then it can also search the index efficiently to answer questions like 'when was this register / piece of memory last updated?', and jump backwards to the instant in question, in log time (not even having to 'run backwards' step by step to get there).
This was an internal side-project of mine at work for many years, only used by my colleagues, but eventually we found reasons to want to publish it, and earlier this year it appeared on Github as an Apache 2 licensed open source project. \o/
Perhaps it's a bit rude to plug my free tool in the comments just after you've finished shilling your commercial one? :-) But I think it's not quite close enough to be a real competitor: my Tarmac browser is Arm-specific, dependent on a prior recording step which it does not implement itself, very limited in the size of problem it can deal with (both by the verbosity of the original trace file and my auxiliary index data structure), and nobody has yet contributed source-language integration. (My original use case for it was debugging code snippets that resulted from miscompilation, so source-level debugging would have been actively misleading.)
no subject
Date: 2021-10-05 08:57 am (UTC)