jack: (Default)
I was pleasantly surprised how easy it was to contribute to rust. It seems like there's a combination of things.

I don't exactly know who the driving forces are in the project. But I think several people are employed by Mozilla to work on rust development, which means there is some full-time work, not only scrabbled moments.

There seems to be a genuine commitment to providing an easy on-ramp. Everything in github seems fairly up-to-date, which makes it a lot easier to get an idea of what's what. Bugs/issues are sorted into many categories, including ones that are easy, suitable for newcomers, which is very welcoming.

There is a bot which takes pull requests and assigns them to a reviewer, so most don't just languish with no-one accepting or rejecting them. The reviewer is chosen randomly from a pool appropriate to the component, and reassigns it if someone else would be better.

Even just spending a couple of days pushing the equivalent of a "hello world" patch through (what is the term for "the effort to make a one-line change with no significant code change"?), it felt like I was part of a project, with ongoing activity about my contribution, not someone screaming well-meaning suggestions into a void.

This isn't rust-specific, but it was the first time I used github for much more than browsing, and it was interesting to see how all the bits, code history, pull requests, etc interacted in practice.

Rust itself had an interesting model. A reviewer posts an approval on the pull request. *Then* a bot runs tests on all approved requests in descending order of priority, and merges them if they pass.

That means, the default assumption is that if a commit to master fails a test for some platform, nothing needs to be rolled back -- further pull requests continued to be tested and merged (assuming they don't gain any conflicts). And "master" is always guaranteed to pass tests.

Currently patches are either tested individually, or ones with inconsequential risks (documentation changes and the like) are tested in a batch. It seems to work well. It relies on the idea that most patches are independent, that they can be merged in any order, which usually seems to be true.

If you took the idea further, you can imagine ways of making it less of a bottleneck. Rather than just testing all patches which happen to be submitted at the same time, you can easily imagine a tier system. Maybe priority. Or maybe, have minor tests (eg. just that everything compiles and some basic quick tests of functionality which is known to have changed) to gate things through a first stage, and find problems quickly, and then a second stage which catches obscure errors but is ok to test multiple patches at once, because it doesn't usually fail.

In fact, I can't imagine working *without* such a system. At work we have a nightly build, but it would have been easy to add a tag for "most recent working version", and that never quite occurred to me, even as I suggested other process improvements.
jack: (Default)
This isn't solely about rust, but it made me think about something I wasn't really aware of. There's several common uses for pointers. The four uses themselves are nothing particular, but I'm interested in thoughts about the speculation about #3 and #4.

1. Heap allocation.

If you allocate a value on the heap, not the stack, you need to refer to it by a pointer. And if you're using a language other than C, automatically de-allocate it after the stack-allocated pointer goes out of scope (either immediately, using a smart pointer in C++, or eventually, in a garbage collected language).

If that's *all* you want to do, you can hide the issue from the programmer completely if you want to, as with languages that expect heap-allocation by default, and you're just supposed to know which copies produce independent duplicates of the same data and which copies refer to the same common data.

In rust, this is a box (?)

2. Pass-by-reference.

If you pass a value to a function and either (a) want the function to edit that value or (b) the value is too large to copy the whole thing efficiently, you want to pass the value by reference. That could be done either with a keyword which specifies that's what happens under the hood, or explicitly by taking a pointer and passing that pointer.

In rust, you pass a ref (equivalent to a C++ reference or pointer), but there are various compile time checks to make sure the memory accesses are safe.

3. I need access to this struct from various different parts of my program.

Eg. a logging class. Eg. a network interface class. Each class which may need to access functionality of those classes needs a way to refer to them. There's a few ways of doing this, which are good enough, although not completely satisfactory.

You can make all those be static. But then there's no easy way to replace them in testing, and there's problems with lifetimes around the beginning and end of your program. You have to be careful to initialise them in the right order, or just assume you don't use them around the time they may be invalid (but that may throw up lots of errors from lint or rust compiler).

You can pass them in as arguments to every function. But that's clunky, and involves a lot of repetition[1]. However, see the weird suggestion at the end.

Or you can just make sure each class has a pointer to the necessary classes (or maybe to a top-level class which itself has pointers or members with the relevant classes), initialise it at class construction. However, this has *some* of the problems of the above two possibilities, it's less easy to replace the functions for testing, and it's somewhat redundant. This one is what's a little weird in rust, I think you have to use objects which are declared const, but actually have run-time-checked non-const objects inside ("interior mutability"). Again, see the weird suggestion at the end.

4. I have some class which contains many sibling objects which need to know about each other.

This *might* be a data structure, if you're implementing a vector, or a doubly-linked list, or whatever for a standard library. Probably not, those are usually implemented with old-school unchecked pointers like in C, and you just make sure you do it right. But it would be *nice* if you could have efficiency *and* checking.

More commonly, there's something like "a parent class with several different subsystems which each need to call functions in different ones of them". Or a computer game where the parent object is a map of tiles, where each tile contains a different thing ("wall" or "enemy" etc), and the types want to do different things depending what's adjacent to them.

In this case, my philosophy has slowly become, as much as possible, have each class return...something, and have the parent class do any glue logic. Which makes the coupling much less tight, ie. it's easier to change one type without it having baked in knowledge of all parent types and sibling types. But doesn't work if the connections are too complicated. And even if it works, gives up some of the flexibility of having different functions for different types of child type, because a lot of functionality has to be in the parent (where "do a different thing" may be "switch statement" not "child pointer derived from base/interface type and function dynamically dispatched". Again, notice, this is functionally very very similar, the question is about what's easy to read and write without making mistakes.

Again, see weird suggestion below.

A weird suggestion for new syntax

This is the bit that popped into my head, I don't know if it makes sense.

We have a system for encapsulating children from parents. The child exposes an interface, and the parent uses the interface and doesn't worry about the implementation. But here we have children who *do* need to know about parents. One option is to throw away the encapsulation entirely, and put things in a global scope.

But how about something inbetween?

Say there is a special way of declaring type A to be a parent (or more likely, an interface/base type which exposes only the functions needed, and an actual class which derives from/implements that), and B1, B2, B3 etc to be children types, types which are declared and instantiated from A.

Suppose our interface, A, exposes a logging member function or class, and two members of types B1 and B2 (because those are expected to be needed by most of the children).

And then, you can only declare or instantiate those children B1, B2, B3 etc in A or a member function of A (that is, where there is a this or self value of type compatible with A). And whenever you call a member function of one of those children, just like that child is passed along in a secret function parameter specified with a "this" or "self" value, there is a similar construct (syntax pending) to refer to members of A.

So, like, "b.foo(x,y)" is syntactic sugar for "foo(b,x,y)" where b becomes "this" or "self", make "a.b.foo(x,y)" syntactic sugar for "foo(a,b,x,y" where b becomes "self" and a becomes "parent" or "a::" or whatever.

Basically, ideally you'd ALWAYS have encapsulation. But sometimes, you actually do have a function you just want to be able to call from... most of your program. Without hassle. You know what you mean. But you can't easily specify it. So it sometimes ends up global. But it shouldn't be *completely* global. It should be accessible in any function called from a top-level "app" class or something, or any function of a member of that, or a member from that, if they opt in.

[1] Repetition

Everyone knows why repetition is bad, right? At best, you put the unavoidably-repeated bits in a clear format so you can see at a glance they're what you expect and have no hidden subtleties. But even above the arguments against, even if people are happy to copy-and-paste code, writing out extra things in function signatures drives people to find any other solution, even crappy ones.
jack: (Default)
Const values

Last time I talked about lifetimes. Now let me talk about the other part of references, ownership and borrow checking.

If you're dealing with const values, this is similar to other languages. By default, one place "owns" a value. Either declared on the stack, or on the heap (in a Box). Other places can be passed a const reference to that value. As described with lifetimes, rust checks at compile time that all of those references are finished with before the original goes out of scope. When the original goes out of scope, it's deallocated (from stack or heap).

Alternatively, it can be reference counted. In rust, you can use Rc<Type> instead of Box<Type> and it's similar, but instead of having a native reference to the value, you take a copy of the Rc, and the value is only freed from the heap when the last Rc pointing to it disappears.

One reason this is important is thread-safety. Rc isn't thread safe, and rust checks you don't transfer it to another thread for that reason. Arc changes reference count atomically so *is* thread safe, and can be sent to another thread. (It's a copy of the Arc that's sent, but one that refers to the same data.)

Const references can't usually be sent between threads unless the original had a lifetime of the whole program (static), because there's no universal way to be sure the thread is done with it, so it's always illegal for the original owner to go out of scope (?) But threads with finite lifetimes are hopefully coming in future (?)

Non-const values

A big deal in rust is making const (immutable) the default, and declaring non-const things (mut). I think that's a good way of thinking. But here it may get confusing.

You can have multiple references to an immutable value. But in order to be thread safe, you can only have one *mutable* reference. Including the original -- it's an error to access the original during the scope of a mutable reference. That's why it's called a "borrow" -- if you make a mutable reference to a value, you can only access the original again once the reference goes out of scope.

But a point that's less well agreed is how useful this is when you don't pass anything between threads.

One argument is that you might be able to have a pointer *to* a value that you then mutate, but if it's something like a vector, you can't have a pointer/reference to a value in it because that might have been invalidated. And even if you have an iterator which could in theory be safe (eg. the iterator contains an index, not just a pointer), you still need to check for the iterator being invalid when it's used, which reduces various optimisations.

Another argument, that I found more interesting, is that even if the value isn't invalidated in a memory-safety sense, if you change the value in two disparate parts of code (say, you loop through all X that are Y calling function Z, and function Z in turn calls function W which does something to some X, including the ones you're iterating through), it's easy for the logic you write to be incorrect, if you can't tell at a glance which values might be changed half way through your logic and which won't be.

I found that persuasive as a general principle. Though I'm not sure how practical it is to work with those constraints in practice, if they're generally helpful once you know how to work with them, or if they're an unnecessary impediment. Either way, I feel better for having thought about those issues.

Workaround, interior mutability

"Interior mutability" is feature of rust types (Cell and RefCell), which is a bit like "mutable" keyword in C++: it allows you to have a class instance which the compiler treats as constant, (eg. allowing optimisations like caching return values), but does something "under the hood" (eg. the class caches expensively calculated results, or logs requests made to it, or keeps a reference count).

There's a couple of differences. One is, as I understand it, you don't just write heedlessly to the mutable value, rather rust checks at run time that you only take one mutable reference to it at once. So if you screw up, it immediate panics, rather than working most of the time but with subtle bugs lurking.

But it's also the case that if you do want a shared class accessed by many parts of your program (a logging class say, is that a reasonable example?), rust encourages you to use interior mutability to replicate the default situation in C or C++, of having a class multiple different parts of your program have a pointer through which they can call (non-const) functions in it.

I have more thoughts on these different ways of using pointers maybe coming up.
jack: (Default)
I haven't looked at lifetimes relating to structs yet.

Come to think of it, if my previous understanding was right, the lifetimes of return values can only ever be a combination of lifetimes of input parameters, so there's only so many possibilities, and the compiler knows which ones are possible (because if you dropped the input parameters, it would know which of the potential return values it would still be valid to read)... why can't it just deduce the output lifetimes? Is it more complicated than that in most (or some) cases?

ETA: One more thing I forgot. Lifetimes don't *do* anything. They're like traits, or types which could have been automatically deduced: the compiler checks that the lifetimes you specify don't leave any variables being used-after-free. But they don't *change* the lifetime of a variable, just tell any code that uses the variable what the lifetime *is*.
jack: (Default)
The formatting is probably going to be screwed up here, because I'm going to use a lot of <. This is a mix of stuff I'm trying to get straight in my mind, so I hope it's somewhat informative, but please point out where I've been unclear, confused or incorrect.

I am going to talk about lifetimes specifically, and save "only one mutable reference at once" aspect of the borrow checker for the following post.

In C or C++, it's possible to take a pointer or reference to a variable, and use the pointer or reference after the value is no longer valid. If it happens within a single function, it's often possible for the compiler (or lint tool?) to warn you. Eg. returning a pointer or reference to a value in a temporary variable. If you have a pointer in a different part of the program, it's easy to miss. Ideally you write code so it doesn't happen, but it's good if it *definitely* can't happen.

Rust makes the equivalent of those compiler warnings a part of the language. Each value has an associated lifetime. That is typically the scope it was first declared in, but could be shorter (eg. if it's artificially dropped) or longer (if it's allocated on the heap). That is basically "how long it's ok to hold a pointer/reference to this value (or part of this value)"[1].

That's all much the same within one function, but rust applies the same guarantees across the whole program. In order to do so, if you have a reference of any sort, it needs to carry along a lifetime parameter. These are usually implicit to avoid boilerplate, which means you can dig yourself in surprisingly far before suddenly discovering you have NO IDEA how this works :)

A simple-ish example might be a function which take a string, and returns a substring. In C++, you would have to choose between returning a new string that copies that substring (with a small overhead), or returning a slice of some sort (a char*, or a special slice type) that references the original memory -- but becomes invalid if that memory goes out of scope and is deallocated. In rust, you can specify that the returned value has the same lifetime as the parameter supplied, and then the normal checks for the calling function make sure that the slice/reference isn't used after the original value is deallocated (or changed).

In fact, if there's only *one* parameter and the function returns a reference, the return is assumed to be a reference to the input parameter (or to part of it) and you don't need to specify the lifetimes, it all just happens. Except slightly more safely than in C++ where you would not usually write a function like that because it's not easy to see if it's used safely.

If there's two input parameters, you need to specify which the return value depends on. In principle you can specify the return value might depend on either, or on both, but I haven't tried anything like that.

That's about as far as I've got. There's more stuff I've thought about, but not certainly enough to talk about it.


The actual format is a special case of a template function. Lifetimes are named like identifiers are, but with a ' at the start. Conventionally 'a, 'b, 'c etc.

The function name is followed by <'a> or <'a, 'b, T> with as many lifetime and/or type parameters as needed. Each input reference can then be annotated with a type parameter after the &. You can use the same lifetime parameter for multiple input references and the function will just use the smallest lifetime (the lifetimes of the parameters supplied don't actually have to be the same).

Then the return value of the function specifies the appropriate lifetime parameter.

fn process_str<'a>(&'a in_str: String) -> &'a String

<b>Question 1</b>

That seems like a lot of confusing boilerplate. Since it seems like lifetimes ONLY come from template parameters, why do you need to specify them in the template parameters list? Why can't that just be omitted?

There's a stack overflow question, but the answer just says "better to be explicit", it doesn't really give any examples of what would be confusing without that.

<b>Question 1a</b>

For that matter, specifying the input parameters at all seems complicated. Since lifetimes can (?) only come from input parameters, why can't they be specified that way?

fn foo(a:String, b:String, c:String, d:String) -> & lifetime(a) String

Return a String with lifetime equal to the lifetime of a. And inside the lifetime construct, you could allow min, max etc to combine lifetimes of variables if necessary.

<b>Question 2</b>

If you dereference a reference to a value correctly before the value does out of scope, it's still an error if the reference is still in scope, even if you don't use it. That sort of makes sense (there's no point having it), but it also doesn't do any harm. Why isn't the end of lifetime considered the last time a variable is *used*, not where it goes out of scope?

There is an rfc to reconsider this question, but I don't think it was acted on. Presumably there's not much benefit and there is a chance of confusion.

<b>Question 3</b>

If the compiler knows where the reference is needed, why can't it keep the value alive that long? Like a reference counted or garbage collected value, but at compile time?

I guess that's just way too complicated or confusing.

<b>Comparison to other languages</b>

This fixes a big problem in languages that habitually have bare pointers (see C, and half of C++)

If you don't care about efficiency, of course, you can just use reference counted references or garbage collected references everywhere. This can occasionally be confusing (if some reference keeps a value alive, but the value isn't really meaningful any more). But basically works. (See the other half of C++, and most other languages.)

<b>Footnote 1</b>

Something I was confused by for a time, is that in rust you can only *copy* values explicitly with .clone() (like in C, you can only memcpy a struct if you know it's safe to do, or in C++, it's implicit, but you need to have a copy constructor). But unlike C, where writing a=b just doesn't work for most types, in rust, you can assign *any* type with a=b, but it functions as a move: it copies the value from b into a with a straight memcpy, including any contained pointers or whatever. But b is then invalid.

It checks at compile time that you can't use b again, so in practice the first time you notice this is "wait, it looked like I assigned ok, but then I got other weird errors".

But there are other benefits, like being able to return a struct value from a function without special arrangements to avoid a temporary copy.

But it confused me about lifetimes, because the contents of an object can often live on after the object is dropped. When in fact, the compiler often arranges that when it *would* memcpy, it actually reuses the same part of memory, so references might still be valid. But that's an implementation detail, so when you "move" an object, that's the end of its lifetime.
jack: (Default)
My goal for January was to learn some rust, and if possible contribute to the rust compiler/library source on github.

Rust is a language aimed at lowish level code with the efficiency of C, but with the safety of a garbage collector and type checker. Someone (ciphergoth? fanf?) originally pointed it out to me, and obviously I'm really interested in that intersection, as my experience has mostly been in lowish level stuff, but also in avoiding all forms of boilerplate and overhead.

For a while, an informal motto was "speed, safety, convenience, pick three", which is presumably won't live up to, but shows how it's being aimed.

It's not ready to replace C or C++, it's still maturing, but has matured a fair bit. And is almost the only language anywhere where using it for things C is used for now is even conceivable.

I don't know if my interest will go anywhere, but I feel like I learned useful things from just trying. Understanding the trade-offs made in designing a language, and the types of code patterns it invites similarly to C++, and ones it recommends for and against, and thinking about what the code I write is doing in practice, seem to have made me understand programming a little bit better.

So far

I read some of the introductory books and articles. I installed the compiler and package manager (on an Ubuntu VM) and made sure I could write a "Hello world" program.

I got the source code for the compiler and libraries, tested I could build it, and looked at the open bugs. I was very pleased that there was an existing effort to tag some bugs as "easy" for new contributors. I didn't try to edit any of the actual compiler code yet, but I did submit a small change to the documentation.

And that there was a bot (rust high-five) welcoming new contributors and assigning reviewers so small patches actually get accepted or rejected, not just languish. And a bot doing continuous integreation (rust bors, with a non-rust-specific development known as homu), specifically testing patches *before* being pulled to master. So changes actually made it into nightly release almost immediately, and three months later into a regular release.

I was also pleased that the code of conduct read like it was written by someone in this century.


I've read something about some of the concepts in rust people find weird, and may try to write something about my understanding, to see how much I've grokked, and get feedback from other people who've played with rust.

I've mentioned in passing several small design choices that I enjoyed. Eg. the error handling, usually returning an Option type, which is either a success with a return value, or an error with an error type or string. Eg. putting type annotations on functions arguments, but relying on automatic variable types within function bodies. I won't review all of these, but in general, they felt good when I saw them. If I actually compare them to what I'm used to in other languages, I'll see if they still feel good.
jack: (Default)
I am still mulling this over after reading some articles on it (thanks, fanf, Kaela).


Imagine you have a fairly simple function.

RetType func1(arg1, arg2)
   return func3(func2(arg1),func2(arg2)).func4();

But those other functions may encounter errors. Eg. they involve opening files, which may not be there.

Assume the error return can't usually be passed to a follow-up function.[1] The obvious then necessary step is for each function call, test the return value, if it's an error, return an error from this function. Else continue with the calculation. But this usually involves several lines of code for each of these functions, which obscures the desired control flow.

If you are willing to accept exceptions, you can just write the code above an allow any exceptions to propagate. But that represents a lot of hidden complexity from not knowing what might be thrown. And often overhead in runtime.

And in fact, this may obscure a common pattern, that for some function (eg. "parse this"), you SOMETIMES want to treat the failure as an error, and sometimes to interrogate it. As in, choose in the calling code whether failure is an error-value or exception.

Also remember, in C-like languages, many values unavoidably have a possible error case which can't be passed to other functions, null pointer. Ideally it would be clear which pointers might be null and which have already been assumed not to be.

In Rust

In Rust (if I understand correctly), these possibilities are often wrapped up in a try macro.

There is a conventional "Result" return type from most functions which may succeed or fail, which has one of two values. Either 'Ok' (usually though not required wrapping a return value). Or 'Err', wrapping a specific error (just a string, or an error object).

The try macro combines the "test return value, if it's an error, return that error from this function, else, evaluate to the successful value" into a brief expression:


Which seems like often what you want. Obviously if you want to handle the error in some way (say, you're interested in whether it succeeds, not just the successful result), you can interrogate the result value for ok or err.

And there's also a macro for "assume success, unwrap the result value, panic if it's not there", just like you can access a pointer without checking for null if you want. But functions which can't return an error shouldn't return "Result", so if you do that, it's clear you *might* fail. Which is exactly what you want for throw-away code. But it does mean, you can search for the unwrap macro if you want to find all the points where you did that and fix them.

Rust recent innovation: ?

I mention try! for historical reasons, but just recently, Rust has promoted it into a language feature, reducing the overhead further from four to six characters, to 1: '?' after a value means the same thing as the try macro.

Result<int, errtype="ErrType"> func1(arg1, arg2)
   return func3(func2(arg1)?,func2(arg2)?)?.func4()?; // Pseudocode, not actual rust syntax

Rust recent innovation: chaining

This is also really new and not standard yet, but I like the idea. Error chaining. The function .error_chain(|| "New error") is applied to the result of a function call. If it was a success, that's fine. If not, this error is added to the previous error. It is typically then followed by the try macro or ?. (I think?)

That means that your function can return a more useful error, eg. "could not open log file" or "could not calculate proportion". Which carries along the additional information of WHY it couldn't, eg. "could not open file XXXX in read mode" or "div by zero".

And then a higher level function can decide which of those it cares about handling -- usually not the lowest level one.

In some ways like exceptions, but (hopefully, because Rust) with no runtime overhead.


[1] I often think of it as, an error-value is one that, under any future operation of any sort, stays the same error value, but that's usually not how it's actually implemented.