Many developers these days praise the refactoring support in Microsoft VisualStudio and JetBrains IntelliJ IDEA, as well as Eclipse variants. Those who enjoy working with other tools, like VSCode or Sublime Text or Vim often lament the limited support for refactoring in those tools.
What if we tried to roll our own refactoring support? What would be difficult about it and what (if anything) would be easy? Let’s try, and see where it leads us! Worst case, we’ll gain an appreciation for how much help refactoring tools are giving us. Best case…well, let’s not get ahead of ourselves.
Rename
One of the most basic refactorings is rename. But let’s not mistake “basic” for “simple”. There are some pretty interesting requirements for a rename function, such as:
- Select text with awareness of surrounding text. That is, if we’re changing “abcd” we might not want to match on “abcde” or “xabcd”.
- Handle case sensitivity appropriately. Do we mean to select “Abcd” as well as “abcd”? What about “ABCD”?
- Allow for “undo” and “redo” operations.
- Before making any changes, show a preview of what would happen if the refactoring were carried out.
- Verify the refactoring was done as expected.
- Control the scope of the rename operation. Whole project source tree? A subset? A group of directories and files that represent some logical construct in the given programming language, such as a Java package?
In addition to general requirements like those, our tool must also handle language-specific considerations. It may not be sufficient just to replace text blindly. Examples:
- If we change a Ruby class name from “MyClass” to “YourClass”, we must also change the filename from “my_class.rb” to “your_class.rb”. The same applies in reverse: If we rename a file that contains a class, we must change the class name in the file, as well as all usages of the class name.
- If we change the name of a public class in Java, we must change the filename to match; and vice versa.
- If we “rename” a package declaration in a Java source file, we have to maintain the alignment between class names and directory structure. What if the user selects the package name and chooses “rename”? What should our tool do in that case? Is it a “rename” or a “move” refactoring?
- Renaming a public class in Kotlin or changing the package where the class resides doesn’t have the same implications as it does in Java. The user may or may not want to change the filename or move the source file to another location in the directory hierarchy. How should our tool handle this case? Should we prompt the user, and if so what about the impact to their work flow when they must answer prompts during refactoring?
- For legacy COBOL in which source statements must fit between columns 8 and 72, we have to handle the case when the new text extends beyond column 72. We may also have to preserve any text present in columns 73-80. This may be a consideration when the intent is to upload the modified code to a mainframe where it will be incorporated with an existing system; we may not have the option to use modern free-form COBOL. Refactoring is often used heavily on legacy code, and the refactored code has to fit back into the system where it lives.
I propose we begin using simple tools to get a feel for the problem. Let’s start with Bash on Linux to do some proof-of-concept experimentation. Once we gain a sense of what kinds of processing we’ll need to support, we can implement the solutions in a language suitable for a plug-in or extension for an editor…or we can decide it isn’t worth the effort.
Rename in current directory
We can change all occurrences of a given string with another string in multiple files using sed (stream editor). It comes with most variants of Linux and Unix. This could be a starting point for our rename function:
GNU sed:
sed -i ‘s/fromtext/totext/g’ *.txt
FreeBSD, Mac OS X sed:
sed -i ” ‘s/fromtext/totext/g’ *.txt
That will change all occurrences of the string “fromtext” to “totext” in all files in the current working directory whose names end with “.txt”.
That’s easy enough, but in many cases it will result in changes we don’t want. Safe refactoring is important. Let’s start building safety into our implementation as early as possible, knowing we won’t achieve perfection.
Crude undo functionality
A basic level of safety may be provided if we enable the user to recover the original text after they review the results of the rename refactoring. As we move to more-robust tools, we may find this challenging, but for the moment we can use sed‘s backup feature to make a copy of the original files. Then an undo is a matter of copying the original files back over their modified copies.
The -i option on sed is short for –in-place. If we want sed to back up the original files, we specify a filename suffix with -i or –in-place, like this:
GNU sed:
sed -i.bkp ‘s/fromtext/totext/g’ *.txt
FreeBSD, Mac OS X sed:
sed -i ‘.bkp’ ‘s/fromtext/totext/g’ *.txt
In keeping with the Principle of Least Astonishment, we should provide users with undo and redo behavior that is usually expected. That means multiple levels of undo. Let’s say we define our backup suffix as a number, like this:
sed -i-1 ‘s/fromtext/totext/g’ *.txt
We can increment the number with each undo and decrement it with each redo. “Deeper” undo backups are removed as we redo our way along. That’s generally the behavior most people would expect without asking or referring to documentation.
Before the rename refactoring:
file1.txt file2.txt
After a rename refactoring:
file1.txt <= modified file1.txt-1 <= original file2.txt <= modified file2.txt-1 <= original
After an undo operation:
file1.txt <= original file2.txt <= original
Now we’ve reached the point that we need some sort of script or program. A one-liner isn’t going to do all those different things. Here’s a Bash script for Linux:
Script rename:
#!/bin/bash function showUsage { echo -e "\nUsage:\nrename -f|--from fromValue -t|--to toValue -g|--glob filenameGlob" } if [ $# -lt 3 ]; then showUsage exit 1 fi echo "rename from $1 to $2 in files $3" sed -i-1 "s/$1/$2/g" $3 exit 0
We could run it like this:
rename fromtext totext *.txt
Now we have backups of all the .txt files that end with .txt-1. Our undo function can be a mv command in a Posix for loop. We can use Bash parameter expansion to remove the backup suffix as we move the file.
for name in *-1; do mv -v — “${name}” “${name%-1}”; done
Script undo:
#!/bin/bash function showUsage { echo -e "\nUsage:\nundo rename" } if [ $# -lt 1 ]; then showUsage exit 1 fi echo "undo the last rename" for name in *.txt-1; do mv -v -- "${name}" "${name%-1}"; done exit 0
Now if we run `rename alpha beta *Test.java` we’ll see files like `FirstTest.java` and `SecondTest.java` containing the modified code along with `FirstTest.java-1` and `SecondTest.java-1` containing the original code.
Then if we run `undo rename` we’ll see the files `FirstTest.java` and `SecondTest.java` containing the original code.
But this isn’t sufficient. We can only support one level of undo. We’ll need more than a one-liner to handle that in a sensible way. Let’s put our one-liner into a Bash script where we can extend the functionality more easily.
Multi-level undo support
Let’s add logic to increment the suffix whenever rename is called, and decrement it whenever undo is called. (Of course, this naming scheme will be a problem if there are “real” files ending in dash and a number. We’d need the ability to define a different suffix and ensure our algorithm handles the logical “increment” and “decrement” operations properly. Let’s defer that until later.)
On Linux we can do something like this to list all the files whose names end with a dash followed by a number starting with 1. This regex isn’t exactly right – it will match names that have letters and numbers at the end, too. But we don’t want to get stuck on that just now.
ls *-*[1-9]
That will list all filenames in the current directory (yes, I know; we’ll get to that) whose names end with a number preceded by a dash (and possibly preceded by other things, too). Knowing tail sorts ascending, we can specify -1 to mean “last thing in the list”.
ls *-*[1-9] | tail -1
That will give us the filename that has the highest number at the end of it. So if we have
YourTest.java
YourTest.java-1
YourTest.java-2
YourTest.java-3
the ls and tail commands will result in
YourTest.java-3
For purposes of incrementing the backup filename suffix, we’re only interested in the number at the end of the filename.
For the rename operation, we want to find the highest-numbered backup file and increment that number by 1 to create the filename of the next backup file. That is, we want to pass “-4” to sed when we do the next rename, resulting in “YourTest.java-4”.
Let’s add logic to our rename script to do that.
#!/bin/bash function showUsage { echo -e "\nUsage:\nrename -f|--from fromValue -t|--to toValue -g|--glob filenameGlob" } if [ $# -lt 3 ]; then showUsage exit 1 fi FROM_VALUE="$1" TO_VALUE="$2" FILENAME_GLOB=$3 exec 3>&2 exec 2> /dev/null LAST_BACKUP="$(ls ${FILENAME_GLOB}*-*[1-9] | tail -1)" exec 2>&3 if [ -z "$LAST_BACKUP" ]; then SUFFIX=0 else OLDIFS=$IFS IFS='-' read -ra NAME_PARTS <<< "$LAST_BACKUP" IFS=$OLDIFS SUFFIX="${NAME_PARTS[-1]}" fi SUFFIX=$((SUFFIX+=1)) for filename in $FILENAME_GLOB; do sed -i-${SUFFIX} "s/${FROM_VALUE}/${TO_VALUE}/g" "$filename" done exit 0
That’s enough to prove the concept that we can save the current version of the files with a suffix that’s incremented with each rename operation. It certainly isn’t an ideal solution. The quasi-regex expression on the ls command will match filenames that aren’t what we’re looking for. The logic itself will change text in ways we don’t intend, depending on what comes before or after the matching text in the files. The way it’s written, we have to enclose the filename glob in quotes when we call the script, or it will only process one file. But it’s okay for now. If we decide this approach is acceptable, we’ll rewrite it in a proper language.
Some of the code amounts to Bash housekeeping, and isn’t especially interesting for our learning experiment. We redirect stderr to null before executing the ls command, because the command will emit an error message if no matching file is found. In our situation, no matching file only means there are no backups from previous rename calls. It isn’t really an error.
Now we’re in a position to effect an undo operation by overwriting the target files with the most recent backup copies. The filenames of the most recent copies have the highest numerical suffixes. So, undo amounts to overwriting the target files with their most recent backups and renaming the backups to the original target filenames.
That automatically removes the latest backup file from the pile, so we can use the same process to undo the refactorings step by step. If we decide to support redo functionality, we will want to keep those backups and track the current undo level some other way besides the filenames. Let’s defer that for now, and build this up one small step at a time. It’s quite possible we’ll see that redo isn’t necessary, after all.
Why? If we invest a lot of time to try and provide “hooks” for a future redo function while we’re building the undo function, we’ll take longer to provide our customers with the undo function, and we’ll probably end up changing our approach to redo by the time we’re really ready to start on it. So it’s best all around to leave it alone for now.
But…this is only a learning exercise. We don’t have customers! True, but we do have habits, and habits grow stronger with repetition. So let’s repeat the habit of not getting ahead of ourselves.
The current version of our undo script, way up there somewhere in this mound of text, assumes there’s only one set of backup files and their names end with “-1”. We need to add the same logic to that script as we put in the rename script to discover the highest-numbered backup file suffix. We’ll do it the time-honored way: Copy and paste. But it isn’t identical; for undo, we don’t want to increment the suffix.
Now a minor change to the mv command inside the for loop to replace the hard-coded “1” with the value of the SUFFIX variable. That leaves us with the following, which appears to do what we want and also does no harm when there are no numbered backup files in the directory.
#!/bin/bash function showUsage { echo -e "\nUsage:\nundo rename filenameGlob" } if [ $# -lt 1 ]; then showUsage exit 1 fi FILENAME_GLOB=$2 exec 3>&2 exec 2> /dev/null LAST_BACKUP="$(ls ${FILENAME_GLOB}*-*[1-9] | tail -1)" exec 2>&3 if [ -z "$LAST_BACKUP" ]; then SUFFIX=0 else OLDIFS=$IFS IFS='-' read -ra NAME_PARTS <<< "$LAST_BACKUP" IFS=$OLDIFS SUFFIX="${NAME_PARTS[-1]}" fi exec 3>&2 exec 2> /dev/null for name in *-${SUFFIX}; do mv -f -- "${name}" "${name%-${SUFFIX}}"; done exec 2>&3 exit 0
Let’s do some cleaning up before we move on. The logic to determine the latest backup file number is the same in both scripts, except the rename script increments the number. Also, setting the SUFFIX to 1 by default (when there are no existing backups) is correct for rename but not for undo. It isn’t causing a problem now, but it could possibly cause problems if we continue to modify the scripts.
Let’s set the SUFFIX to 0 by default instead, and increment it outside the loop. Now let’s do some quick and dirty testing, as we haven’t been test-driving the code. [Tick, tock.] Seems okay for our immediate purposes.
I called that “cleaning up” rather than “refactoring” because it changes behavior – initializing the SUFFIX to 0 instead of 1 is potentially a behavior-changing modification. Now we’re in a position to refactor.
We have some duplication in the two scripts. The logic to determine the latest backup file suffix is identical in both scripts. Let’s pull it out into a separate function and start a file to contain common functions, called common_functions, for lack of a better name.
function find_latest_backup_suffix { exec 3>&2 exec 2> /dev/null LAST_BACKUP="$(ls ${FILENAME_GLOB}*-*[1-9] | tail -1)" exec 2>&3 if [ -z "$LAST_BACKUP" ]; then SUFFIX=0 else OLDIFS=$IFS IFS='-' read -ra NAME_PARTS <<< "$LAST_BACKUP" IFS=$OLDIFS SUFFIX="${NAME_PARTS[-1]}" fi }
Now we can source that file in both the rename and undo scripts. Here’s rename:
#!/bin/bash . common_functions function showUsage { echo -e "\nUsage:\nrename -f|--from fromValue -t|--to toValue -g|--glob filenameGlob" } if [ $# -lt 3 ]; then showUsage exit 1 fi FROM_VALUE="$1" TO_VALUE="$2" FILENAME_GLOB=$3 find_latest_backup_suffix SUFFIX=$((SUFFIX+=1)) for filename in $FILENAME_GLOB; do sed -i-${SUFFIX} "s/${FROM_VALUE}/${TO_VALUE}/g" "$filename" done exit 0
And here’s undo:
#!/bin/bash . common_functions function showUsage { echo -e "\nUsage:\nundo rename filenameGlob" } if [ $# -lt 1 ]; then showUsage exit 1 fi FILENAME_GLOB=$2 find_latest_backup_suffix exec 3>&2 exec 2> /dev/null for name in *-${SUFFIX}; do mv -f -- "${name}" "${name%-${SUFFIX}}"; done exec 2>&3 exit 0
That’s better, but there are still some glaring issues here. For one, the solution only works on files located in the directory where we execute the command. It would be preferable for it to start at whatever directory we choose, and to process directories recursively.
Another problem is the script clutters the current directory with backup files. Let’s tackle that problem now.
Moving work files to a temp directory
Most development tools have one or more directories they use to store temporary files and configuration files that aren’t part of a user’s project. We should store our temporary files outside the project source tree, too.
This is just a learning exercise, so it doesn’t make much difference what we call the temp directory. Why don’t we just call it temp?
We’ll define an environment variable, TEMPDIR, where we can put the name of the temporary directory. We can create a file, env, where we can put our environment variable settings. As common_functions gets pulled into the other scripts, we can source env in common_functions and the other scripts won’t have to know about it.
Now we need to modify the logic in rename and undo to use the tempoary directory. I’m not proficient enough with sed to know if there’s a better way, but I think we can move the backup file after running the sed command. Poking around with git in a playground directory, it seemed that if we create a file and immediately remove it, it doesn’t show up as “untracked” or interfere with git in any way.
Why do we care? Because when the user commits changes, we can remove the temporary backup files. There’s no need to undo changes in this manner, as they can revert to an earlier commit if necessary. Our temporary files only need to exist up to the point the user commits. Our temporary directory is not supposed to be included in version control anyway, but we’re creating temporary files directly in the source directory and then moving them to the temporary directory, so there’s a possibility git might think we intended to track them. It looks as if that won’t be a problem.
This involves changes in common_functions, rename, and undo to write the backup files into the new TEMPDIR directory and restore them from there for undo.
The code that we have so far doesn’t know where to copy the backup files for an undo operation. The replacement files are left in the TEMPDIR directory, and not in the directory where they came from. To fix that, we’ll introduce another environment variable, PROJECT_ROOT. The undo script will prepend the value of PROJECT_ROOT to the destination filename for the mv.
For now, we’re working with files that are all in the same directory. We’ll call that PROJECT_ROOT. Once we have that working, we’ll see about handling a more realistic directory tree.
File env:
# Paths are relative to PROJECT_ROOT PROJECT_ROOT=/home/neopragma/text-sandbox TEMPDIR=./temp
File common_functions:
. env function find_latest_backup_suffix { pushd "$TEMPDIR" exec 3>&2 exec 2> /dev/null LAST_BACKUP="$(ls ${FILENAME_GLOB}*-*[1-9] | tail -1)" exec 2>&3 if [ -z "$LAST_BACKUP" ]; then SUFFIX=0 else OLDIFS=$IFS IFS='-' read -ra NAME_PARTS <<< "$LAST_BACKUP" IFS=$OLDIFS SUFFIX="${NAME_PARTS[-1]}" fi popd }
File rename:
#!/bin/bash . common_functions function showUsage { echo -e "\nUsage:\nrename -f|--from fromValue -t|--to toValue -g|--glob filenameGlob" } if [ $# -lt 3 ]; then showUsage exit 1 fi FROM_VALUE="$1" TO_VALUE="$2" FILENAME_GLOB=$3 find_latest_backup_suffix SUFFIX=$((SUFFIX+=1)) for filename in $FILENAME_GLOB; do sed -i-${SUFFIX} "s/${FROM_VALUE}/${TO_VALUE}/g" "$filename" mv "${filename}-${SUFFIX}" "$TEMPDIR/${filename}-${SUFFIX}" done exit 0
File undo:
#!/bin/bash . common_functions function showUsage { echo -e "\nUsage:\nundo rename filenameGlob" } if [ $# -lt 1 ]; then showUsage exit 1 fi FILENAME_GLOB=$2 find_latest_backup_suffix exec 3>&2 exec 2> /dev/null for name in ${TEMPDIR}/*-${SUFFIX}; do DESTINATION="${name%-${SUFFIX}}" # remove backup suffix DESTINATION="${DESTINATION##${TEMPDIR}/}" # remove TEMPDIR DESTINATION="${PROJECT_ROOT}/${DESTINATION}" # prepend PROJECT_ROOT mv -f -- "${name}" "$DESTINATION"; # move the file done exec 2>&3 exit 0
Running these scripts against a couple of simple text files seems to work as expected. Obviously, this is not a comprehensive test. Here are the text files:
File text.txt:
The rain in Spain falls mainly in the plain. Rain, rain, go away.
File text2.txt:
The rain in Germany falls most orderly. Rain, rain, go away.
After running the rename script like this:
rename rain snow “*.txt”
The text files are changed as follows:
File text.txt:
The snow in Spain falls mainly in the plain. Rain, snow, go away.
File text2.txt:
The snow in Germany falls most orderly. Rain, snow, go away.
In the ./temp directory, we find the original files with the names text.txt-1 and text2.txt-1.
Running the undo script like this:
undo rename “*.txt”
Restores the text.txt and text2.txt files to the way they were originally and removes the backups from the ./temp directory.
Propagating refactorings through the source tree
In principle, our approach should work provided we specify the complete pathname of each file. Of course, in “real life” we wouldn’t ask the user to type a long command line just to do a “rename”. They would select something in an editor and choose a “refactor” option in some manner, via keystrokes or pointer gestures. Under the covers, the refactoring tool would supply the absolute path of affected files to the rename function. No doubt many other implementations are possible, too.
For our immediate purposes, we can continue to use the command line. What happens if we specify a file that isn’t in the project root directory? Let’s make up a fake directory structure and stick a file in it. Here’s a sort of Java-esque source tree:
src | +-- main | | | +-- java | | | | | +-- com | | | | | +-- nothing | | | | | +-- something | | | | | +-- ClassOne.java | | | +-- resources | +-- test | +-- java | | | +-- com | | | +-- nothing | | | +-- something | | | +-- ClassOneTest.java | +-- resources
The file ClassOne.java has some junk code in it. When we try this:
rename cuatro latePayment “src/main/java/com/nothing/something/*.java”
we get an error:
mv: cannot move 'src/main/java/com/nothing/something/ClassOne.java-1' to './temp/src/main/java/com/nothing/something/ClassOne.java-1': No such file or directory
The intent is that we replicate the source tree under our temp directory, to the extent necessary to store temporary backup files in the right relative locations. But the tool has to create the directories under TEMPDIR. In this case, we have to create the directory “$TEMPDIR/src/main/java/com/nothing/something” before we move the backup file to TEMPDIR. We’ll also have to remove those directories whenever we clean out TEMPDIR.
The rename function has to identify the path to each file and create the same directory hierarchy under TEMPDIR. We can use `mkdir -p` do accomplish this safely. Let’s add some logic to rename to do it.
The for loop in the rename script now looks like this:
for filename in $FILENAME_GLOB; do sed -i-${SUFFIX} "s/${FROM_VALUE}/${TO_VALUE}/g" "$filename" TEMPFILE="${filename}-${SUFFIX}" PATH_FROM_PROJECT_ROOT=$(dirname "$TEMPFILE") mkdir -p "${TEMPDIR}/${PATH_FROM_PROJECT_ROOT}" mv "$TEMPFILE" "${TEMPDIR}/${TEMPFILE}" done
Now when we run rename the temporary backup files appear to be saved where we expect them to be.
We’ve introduced a new problem, however. What happens when we run the following rename refactorings?
rename rain snow "*.txt" rename cuatro latePayment "src/main/java/com/nothing/something/*.java"
The changes are correct, but the temporary backup files appear as:
$TEMPDIR/text.txt-1 $TEMPDIR/text2.txt-1 $TEMPDIR/src/main/java/com/nothing/something/ClassOne.java-1
We haven’t changed the undo script. Let’s see what happens when we run it as-is. We’ve done two rename refactorings, and our intent is to reverse the changes to ClassOne.java, but not the changes to the text files in the project root directory. All the backup files end in “-1”.
undo rename "src/main/java/com/nothing/something/*.java"
Both rename refactorings were undone, when the intent was only to undo the latest one. The problem lies with the find_latest_backup_suffix function. To set the LAST_BACKUP variable, the function should look everywhere in TEMPDIR to find the latest backup filename suffix, and not just at the filenames that match the glob.
Let’s make that change. First, in common_functions we want the find_latest_backup_suffix function to look everywhere under TEMPDIR and pluck out the highest suffix number. All the files with that number are involved in the most recent “rename” refactoring, regardless of their location in the source tree. For Bash, we need to set “globstar” and adjust the pattern for finding backup files:
. env function find_latest_backup_suffix { pushd "$TEMPDIR" exec 3>&2 exec 2> /dev/null shopt -s globstar # <= change LAST_BACKUP="$(ls **/*-*[1-9] | tail -1)". # <= change exec 2>&3 if [ -z "$LAST_BACKUP" ]; then SUFFIX=0 else OLDIFS=$IFS IFS='-' read -ra NAME_PARTS <<< "$LAST_BACKUP" IFS=$OLDIFS SUFFIX="${NAME_PARTS[-1]}" fi popd }
We need to make a complementary change in the undo script so that we pick up all the backup files related to the most recent “rename”:
#!/bin/bash . common_functions function showUsage { echo -e "\nUsage:\nundo rename filenameGlob" } if [ $# -lt 1 ]; then showUsage exit 1 fi find_latest_backup_suffix exec 3>&2 exec 2> /dev/null shopt -s globstar # <= change for name in ${TEMPDIR}/**/*-${SUFFIX}; do # <= change DESTINATION="${name%-${SUFFIX}}" DESTINATION="${DESTINATION##${TEMPDIR}/}" DESTINATION="${PROJECT_ROOT}/${DESTINATION}" mv -f -- "${name}" "$DESTINATION"; done exec 2>&3 exit 0
We don’t need the same filename glob for undo as we do for rename. All we care about are the filename suffixes of the backup files. So we can simplify undo.
undo rename
Checkpoint
We’ve done a fair amount of work – at least, it feels that way to me! – and we’ve barely scratched the surface of just one refactoring. If you look back at the list of “requirements” at the beginning of this article, you’ll see we’ve only covered a little bit of what a “rename” feature has to do.
In fact, of the six bullet points under general requirements, we’ve completed exactly: Zero. Of the language-specific considerations, we’ve covered exactly: Zero. That’s progress!
Actually, it’s pretty good progress, all things considered. We have a more-or-less functional rename script and an undo script that can reverse the changes. There are several directions we could take from here.
- Continue refining the rename script. From this, we could learn whether our approach is sound, or if there are “gotchas” lying in wait ahead of us on this path.
- Develop a crude implementation of another refactoring – extract constant, or whatever. From this, we could learn whether some of the logic we’ve built so far can be reused for other refactoring operations. (I suspect there is some generally-usable stuff in there.) We could also discover any hidden “gotchas” with our approach to undo functionality; it’s possible our implementation only works for rename.
- Try to implement our solution as a plugin or extension to an editor. From this, we could learn whether our approach is transferable to “real” code, or if we’re off in the weeds.
There may be even more options than those three. If we were working as a team, we’d have a discussion and decide what we’d like to do next. But we’re not working as a team. So, I guess I’ll decide what to do when I start on the next post in this series.
Here’s a Github repository for the experimental code: reefer-madness.