Update 2024-05-04T2202: ruby was accidentally assigned to python’s behavior set. Thanks to Karl for pointing it out!
Motivation
Many systems provide functionalities to join file paths. Specifically shells and filesystem APIs make such functionalities accessible to the user. So if we join foo
and bar
, we want to get foo/bar
on a UNIX system. One should not implement this behavior with a common string library, because joining foo/
with bar
should still be foo/bar
and not foo//bar
. Simultaneously, we need to take operating-system specific behavior into account. Win32 does not use a slash, but a backslash to separate file path components. Furthermore UNIX names starting with a slash “absolute filepaths” but Windows uses Universal Naming Convention (UNC) with prefixes like C:\
. A library joining filepaths should handle absolute filepaths according to the filesystem in question.
Now I got interested in the question: what should the behavior of foo
joining /bar
be?
The motivating behavior
golang implements the following behavior:
package main
import (
"fmt"
"path/filepath"
)
func main() {
fmt.Println(filepath.Join("foobar", "/etc/password"))
// gives "foobar/etc/password"
}
python implements the following behavior:
import os.path
print(os.path.join('foobar', '/etc/passwd'))
# gives "/etc/passwd"
And this behavior is confusing to many users:
-
StackOverflow: Why does Path.Combine not properly concatenate filenames that start with Path.DirectorySeparatorChar?
-
python issue 11378: “os.path.join when second argument starts with '/' (linux/unix)”
-
python issue 35223: “Pathlib incorrectly merges strings.”
-
python issue 31617: “os.path.join() return wrong value in windows”
-
python issue 44452: “Allow paths to be joined without worrying about a leading slash”
-
python ideas proposal: “pathlib.Path.joincomponent()”
Independent of preferences, one needs to consider actual vulnerabilities emerging. Indeed, there are two recent ones motivating this evaluation:
-
CVE-2024-1708 refers to a vulnerability in ASP.NET code by Connectwise ScreenConnect
-
CVS-2024-20345 is also related to path traversal but maybe not this behavior.
Evaluation per programming language
rust implements python’s behavior:
fn main() {
let mut p = std::path::Path::new("foobar");
println!("{}", &p.join("/etc/passwd").display());
// gives "/etc/passwd"
}
C++'s filesystem API (since C++17) implements python’s behavior and refers to POSIX:
std::filesystem::path("foobar") / "/etc/passwd";
// the result is "/etc/passwd" (replaces)
Java declares the behavior as “provider specific” and thus the situation remains unclear, because I don’t have a JVM at hand to try it out.
Path.Combine on .NET implements python’s behavior:
“If the one of the subsequent paths is an absolute path, then the combine operation resets starting with that absolute path, discarding all previous combined paths.”
Dart & Flutter provide python’s behavior. D implements python’s semantics as well as documented by “If any of the path segments are absolute (as defined by isAbsolute), the preceding segments will be dropped”.
buildPath("/foo", "/bar")
// /bar
Tcl file join is also on python’s side: “If a name is an absolute path, all previous arguments are discarded and any subsequent arguments are then joined to it.”
And then finally, we have more supporters of golang’s behavior:
-
ruby implements golang’s behavior:
puts "Hello World" p File.join("foobar", "/etc/passwd") #=> "foobar/etc/passwd"
-
Nim with
joinPath("usr/", "/lib")
as"usr/lib"
-
FreePascal fails to mention the implemented behavior in ConcatPaths, but I tried it out and it sides with golang. Unlike other APIs, it specifically provides functions such as ExcludeLeadingPathDelimiter.
-
PowerShell’s join-path implements golang’s behavior:
-
zig sides with golang as well
Where does this behavior come from?
When I heard of it, I thought that these are common POSIX semantics. I was wrong. These strings are not passed down some API, but the behavior is implemented by programming languages:
-
rust has dedicated code to implement this behavior:
-
python as well
And I was not able to reproduce the behavior in libc/syscalls. First, I tried chdir
:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/limits.h>
int main(int argc, char* argv[])
{
chdir("foobar//etc");
char cwd[PATH_MAX];
if (getcwd(cwd, sizeof(cwd)) != NULL) {
printf("Current working dir: %s\n", cwd);
} else {
return 1;
}
return 0;
}
It prints Current working dir: /tmp
when it is run inside /tmp
. It seems to reject the provided filepath. The same is true for fopen
:
fopen("main.c//tmp/main.go", "r")
… returns NULL. Maybe I need to use less libc and more POSIX:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char* argv[])
{
int fd = open("main.c//tmp/main.go", O_EXCL);
printf("%d\n", fd);
return 0;
}
… but also this pure open
example prints -1
indicating an error. Finally realpath
also returns NULL:
#include <stdio.h>
#include <fcntl.h>
#include <limits.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
char result[PATH_MAX];
char *ret = realpath("main.c//tmp/main.go", result);
printf("%p\n", ret);
return 0;
}
In the end, POSIX specifies how to traverse/resolve a filepath, but there is no functionality to join them. Thus, there was likely no necessity to specify this behavior.
If POSIX is not responsible and programming languages like python and rust implement it themselves, when did it start? When did python implement the behavior first?
-
python 0.9.1 did not yet have a path.join function.
-
python 1.2 provides
path.join
(inposixpath.py
) with the following implementation mentioning the behavior explicitly:# Join two pathnames. # Ignore the first part if the second part is absolute. # Insert a '/' unless the first part is empty or already ends in '/'. def join(a, b): if b[:1] == '/': return b if a == '' or a[-1:] == '/': return a + b # Note: join('x', '') returns 'x/'; is this what we want? return a + '/' + b
-
python 1.5.2 accepts a variadic number of arguments and continues implementing this behavior:
# Join pathnames. # Ignore the previous parts if a part is absolute. # Insert a '/' unless the first part is empty or already ends in '/'. def join(a, *p): """Join two or more pathname components, inserting '/' as needed""" path = a for b in p: if b[:1] == '/': path = b elif path == '' or path[-1:] == '/': path = path + b else: path = path + '/' + b return path
Ok, so we know that python 1.2 already had this behavior. Does I explain why? No.
By the way, PEP 428 from 2012 introduced an object-oriented API for filesystem paths in python. Did they change the behavior?
from pathlib import Path
Path('foobar') / '/etc/passwd'
# gives "PosixPath('/etc/passwd')"
No, but simultaneously, a different behavior can easily be achieved:
from pathlib import Path
child = Path('/etc/passwd')
Path('foobar') / child.relative_to(child.anchor)
# gives "PosixPath('foobar/etc/passwd')"
The question of expected behavior
What is the expected behavior in the end? In 2014 in rust issue 16507, a user writes:
I would expect that
a.join(&b)
would return/foo/bar
, however it returns/bar
. Given my experience w/ path joining in Ruby and Go, I would expect that join concats two paths and does some normalization to remove double slashes, etc…
The user got 22 thumbs-ups for this initial issue description. lillyball counterargues:
I agree with @aturon. The only sensible operation when joining an absolute path onto some other path is to get the absolute path back. Doing anything else is just weird, and only makes sense if you actually think of paths as strings, where "join" is "append, then normalize". I do not understand why Go’s
path.Join
behaves in this way, although they are actually taking strings as arguments.
The C++ community also seemed to be divided, because diverging arguments have been raised during the definition process of C++17. In the end python’s behavior was still implemented for POSIX systems. First, some arguments in favor of python’s behavior were raised (2014):
This means that, for example,
"c:\x" / "d:\y"
gives"c:\x\d:\y"
, and that"c:\x" / "\\server\share"
gives"c:\x\\server\share"
. This is rarely, if ever, useful.An alternative interpretation of
p1 / p2
could be that it yields a path that is the approximation of what p2 would mean if interpreted in an environment in whichp1
is the starting directory. Under this interpretation,"c:\x" / "d:\y"
gives"d:\y"
, which is more likely to match what was intended.
Later the opposite behavior was suggested and formalized (2017):
US 77, CA 6; 27.10.8.4.3 [path.append: operator/ and other appends not useful if arg has root-name]
“Passing a path that includes a root path (name or directory) to
path.operator/=()
simply incorporates the root path into the middle of the result, changing its meaning drastically. LWG 2664 proposes disallowing a path with a root name (but leaves the root directory possibility untouched); US 77/CA 6 (via P0430R0) objects and suggests instead making the same case implementation-defined. (P0430R1 drops the matter in favor of this issue.)”[…]
// On POSIX, path("foo") / ""; // yields "foo/" path("foo") / "/bar"; // yields "/bar" // On Windows, backslashes replace slashes in the above yields
Let us summarize the statistical data from above:
behavior | count | example output |
---|---|---|
python’s |
6 |
/etc/passwd |
golang’s |
6 |
foobar/etc/passwd |
unclear |
1 (Java) |
My personal opinion on this is the following:
-
I think
joining
means taking equivalent elements and concatenating them to work together (c.f. python’s str.join) -
The fundamental problem is that
foo/bar
and/foo/bar
are not equivalent elements at all. A relative and an absolute filepath carry different semantics. A relative filepath refers to different elements depending on your current location compared to an absolute filepath which is fixed. In terms of type systems, one might want to model them as two different types (because different operations can be done). -
Joining a relative and an absolute path is a hazard, because the absolute path dictates “start here”. Since joining happens from left-to-right (in our LTR writing systems), python’s behavior makes sense and corresponds to the semantics of relative/absolute file paths.
-
Never trust user input! If you actually allow users to specify file paths (request URLs in webservers, arguments in command line tools to fetch data from production systems), you need to verify which arguments are allowed and check that. Certainly the standard library should help you with it, but always read the corresponding documentation to match your expectations with reality.
-
Apparently, CVEs appeared and people get it actually wrong. This is certainly an argument in favor of golang’s behavior. However, whereas I consider
C:\Windows
joinD:\Media
resulting inD:\Media
surprising, I personally considerC:\Windows\Media
arbitrary. In the end, I think an error is the only way to go. -
I think we just don’t get the idea of file paths wrong. We consider relative and absolute filepaths as equivalent even though they are not. Maybe shells just gave us the wrong idea that everything is just a string anyways.
BTW, werkzeug (and thus flask utilizing it) has built its own safe_join function.
Conclusion
My original intention for investigating this topic was to determine who came up with this behavior originally. I was convinced it comes from the POSIX world, but I could not find any supporting evidence. I was wrong. The idea seems to come from the early days of programming languages (before 1990) and the actual origin remains unclear to me.
Thus I designed the blog article around the question “which behavior is implemented?” and also “what is the expected behavior?”. And everything, except throwing an error, seems insane to me now.