Extracting: Railroad Tycoon

Hier könnt ihr euch selbst, eure Homepage, euren Entwicklerstammtisch, Termine oder eure Projekte vorstellen.
Forumsregeln
Bitte Präfixe benutzen. Das Präfix "[Projekt]" bewirkt die Aufnahme von Bildern aus den Beiträgen des Themenerstellers in den Showroom. Alle Bilder aus dem Thema Showroom erscheinen ebenfalls im Showroom auf der Frontpage. Es werden nur Bilder berücksichtigt, die entweder mit dem attachement- oder dem img-BBCode im Beitrag angezeigt werden.

Die Bildersammelfunktion muss manuell ausgeführt werden, die URL dazu und weitere Details zum Showroom sind hier zu finden.

This forum is primarily intended for German-language video game developers. Please don't post promotional information targeted at end users.
Antworten
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

tl:dr; Wall of text about a very old game with no current relevance. And about this Forum. And yet you are here ...

Introduction

Long story short, I have managed to convert some of the assset files that come with "Sid Meier's Railroad Tycoon Deluxe" into a modern file format (png) and have a pretty good idea of what the other files are about. In the end this is not the tale of a great feat I would have hoped it would become. My moderate success was not due to some sudden spark of genius but due to more sleepless nights than I care to admit in which I banged my head at the disassembly.

Over years I have revisited and left the problem again many times, Years with pauses of several months between shorts bursts of renewed interest, the latter often coinciding with updates in Krishty's "Extracting Ace Combat" Thread here.

I don't even know what I did wrong over all that time. One evening without consciously doing anything different than before, I "just knew" I was, finally, looking at the disassembly of the right function (Figure 1). I ported the assembly verbatim to C (think: uint16_t ax = 0x0;) , and I was done. Subsequent restructuring of the code in a more high-level style did not reveal any hidden secrets, at least not to me. Totally anticlimactic.

_______________________________________________________________________________________________________________________

Code: Alles auswählen

_071B:0010
_071B:0010 ; =============== S U B R O U T I N E =======================================
_071B:0010
_071B:0010 ; Attributes: bp-based frame
_071B:0010
_071B:0010 sub_56E0        proc far                ; CODE XREF: sub_1EE2+B2↑P
_071B:0010
_071B:0010 arg_0           = dword ptr  6
_071B:0010 arg_4           = dword ptr  0Ah
_071B:0010 arg_8           = word ptr  0Eh
_071B:0010
_071B:0010                 push    bp
_071B:0011                 mov     bp, sp
_071B:0013                 push    si
_071B:0014                 push    di
_071B:0015                 push    ds
_071B:0016                 push    es
_071B:0017                 push    bx
_071B:0018                 cld
_071B:0019                 mov     ax, [bp+arg_8]
_071B:001C                 cmp     ax, 1
_071B:001F                 ja      short loc_56F3
_071B:0021                 out     0A6h, al        ; Interrupt Controller #2, 8259A
_071B:0023
_071B:0023 loc_56F3:                               ; CODE XREF: sub_56E0+F↑j
_071B:0023                 les     di, [bp+arg_4]
_071B:0026                 lds     si, [bp+arg_0]
_071B:0029                 mov     dx, 10h
_071B:002C                 lodsw
_071B:002D                 mov     bp, ax
_071B:002F
_071B:002F loc_56FF:                               ; CODE XREF: sub_56E0+2C↓j
_071B:002F                                         ; sub_56E0+76↓j ...
_071B:002F                 shr     bp, 1
_071B:0031                 dec     dx
_071B:0032                 jnz     short loc_5709
_071B:0034                 lodsw
_071B:0035                 mov     bp, ax
_071B:0037                 mov     dl, 10h
_071B:0039
_071B:0039 loc_5709:                               ; CODE XREF: sub_56E0+22↑j
_071B:0039                 jnb     short loc_570E
_071B:003B                 movsb
_071B:003C                 jmp     short loc_56FF
_071B:003E ; ---------------------------------------------------------------------------
Figure 1: The Disassembly looks like this. You knew that, of course, because all disassembly somewhat looks like this. That exactly is the problem with finding anything meaningful in it.
_______________________________________________________________________________________________________________________

This post will probably make very little sense without the context of Krishty's thread as linked above. I suggest you read that thread first. It is the more interesting read anyway. I will reference that thread a lot, maybe to the point where it gets awkward. To avoid misunderstandings: Krishty and I never met. The number of PMs we exchanged over the years can be counted on one hand, I think. He does not know about this project until now and I do not know anything about him other that he did a cool thing and I wanted to do the same. I will go into my motivation some more now but part of me thinks, that is all hindsight rationalisation and basically this is about competition. "If Krishty can do it, I will be damned if I can't." Maybe I'm overthinking this.

Motivation

"Sid Meier's Railroad Tycoon Deluxe" (what a title! let's call it RRT) was not the first videogame I played but the first one I owned. The first physical disc someone bought in a shop and gifted to me. The first disk that was inserted into the newly bought family computer to install a game, or probably any software, with little to no concept among the gathered users what that was: installing software. It was not a new game when I got it, I think the first release was 1990 - without the "Deluxe" back then - and this anecdote takes place ca. 1996. Many now classic gaming franchises were founded in that year, you can google for yourself, but the 1996 release surely most relevant for this story is "Sid Meier's Civilisation II", arguably marking the passage of a whole generation in the genre. I was late to the party.

Ten more years later, 2kgames made RRT available for free as part of the marketing campaign for "Sid Meier's Railroads!" an otherwise unremarkable game that does not run on Windowses newer than Windows XP. They bundled RRT with a Dosbox and a .pdf of the copy protection sheet that was so usual in these days. In preparation for this post, I checked and the link is now dead. There is no longer any sign on the 2K Homepage that "Railroads!", let alone RRT ever happened. So 2K seems to have abandoned RRT almost 30 years after its first release but RRT never again left me after downloading that full digital copy.

You can still get RRT at https://www.pcwelt.de/downloads/Sid-Mei ... 73103.html apparently, which seems legit, but I haven't tried it and you would proceed at your own risk.

As it is probably clear from the above, nostalgia is a main driver of what follows. Krishty speaks of "nostalgische Verstimmung" (an untranslatable play of words along the lines of "nostalgic moodiness" and the desire for conservation (here I won't even try to translate his exact words).

For me it is less about conservation and more about evolution. The interest in games, kindled by the one we discuss here, led to an interest in game development - who would have guessed, given were we are here - led quickly to the realisation that I am unable to produce even passable "programmers art". There are many reasons why I never produced a single game, however small, but this is the one reason which I would have put forward at age 14 or so. An age at which admitting any other reason would have been too much for a developing ego. One day around that time I realised that one of my games (not RRT) stored its assets as bitmaps which I could read. Since that day I had this fantasy that I could use the assets of an existing game to make a new one. Like I said, this never happened, but the idea somehow sticks with me until this day.

The second motivator, which I apparently also share with Krishty, is the urge to do something comparatively hard with no conceivable utility other than my own entertaiment just because it is possible. The analogy of a jigsaw-puzzle comes too mind. Simply knowing that it had to be possible to read these files because, of course, the original game did just that, helped me over many dead ends.

What next?
To manage expectations: Given how long it took to write down only this part, I am not sure I will ever complete the write-up. Given that it took me years to get to the point were I had something to write about, what are a few month for the write-up? If I get around to it you might eventually learn how I obtained Figure 2. For now you can find Part II here.

_______________________________________________________________________________________________________________________

screen1.png
screen1.png (6.55 KiB) 3584 mal betrachtet
Figure 2: Teaser. Here is something you won't see in the game. A tile masked "Err", probably for debugging purposes.
_______________________________________________________________________________________________________________________
Benutzeravatar
Krishty
Establishment
Beiträge: 7858
Registriert: 26.02.2009, 11:18
Benutzertext: state is the enemy
Kontaktdaten:

Re: Extracting: Railroad Tycoon

Beitrag von Krishty »

Krishty Krishty Krishty … cool stuff about RRT … Krishty Krishty Krishty ;)

Great work and very lovely tiles you mined there! I’m keen on details – I wrote decoders for some 90s image codecs (PackBits & Co.) and always like to learn about new ones. I didn’t bother to decode the disassembly, but it looks hand-written – no modern compiler would emit such dense encoding, and I doubt compilers in the 90s were any better.
One day around that time I realised that one of my games (not RRT) stored its assets as bitmaps which I could read. Since that day I had this fantasy that I could use the assets of an existing game to make a new one.
Well written … this probably got me on ZFX :-)
seziert Ace Combat, Driver, und S.T.A.L.K.E.R.   —   rendert Sterne
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

Krishty hat geschrieben: 01.08.2019, 21:10 I didn’t bother to decode the disassembly, but it looks hand-written – no modern compiler would emit such dense encoding, and I doubt compilers in the 90s were any better.
Yes, I'm pretty sure the image decoding was directly written in ASM. The basic unit it operates on is a bit (not a byte) the way this works is to shift and then jump based on the carry flag. I am not aware of a way how this can be expressed in C directly. The algorithm is actually easier to express in ASM than it is in C I think.

In C, to my best knowledge I can do no better than this:

Code: Alles auswählen

cf = bp & 0x01;
bp = bp >> 1;
if (cf) 
where in ASM I have this:

Code: Alles auswählen

shr bp, 1
jb somewhere
I wasn't able to determine if compilers understand that these are equivalent but in any case the C version is a very indirect way of expressing this. Usually it is the other way round, I assume.
Benutzeravatar
Krishty
Establishment
Beiträge: 7858
Registriert: 26.02.2009, 11:18
Benutzertext: state is the enemy
Kontaktdaten:

Re: Extracting: Railroad Tycoon

Beitrag von Krishty »

Yeah, no way you can do this in C. I learned it the hard way while trying to implement integer overflow checks – carry/overflow is totally out of reach for compilers if you are not using the according intrinsics (_mm_addcarry_u32() & Co.). Must be hand-written. Very cool trick, though it probably fails micro-op fusion on modern CPUs.
seziert Ace Combat, Driver, und S.T.A.L.K.E.R.   —   rendert Sterne
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

tl:dr; Lengthy setting of the stage and revelling in nostalgia. Some minor technical points but mostly backstory.

Versions, Documentation, and other Quirks

In Part I, I abbreviated "Sid Meier's Railroad Tycoon Deluxe" to "RRT". This might cause some confusion to potential readers. Only after writing I learned, that among some fans "RRT" is reserved for the original game (without the "Deluxe") and the latter is commonly referred to as "RDX". But to me it always will be just "RRT" and this thread is only about the "Deluxe" version.

Sadly, I never played the original. I would love to, but chances of ever getting my hands on a copy are pretty small. Some consider the original version the better game and from what I read in retrospect some things that seemed to be improvements at the time might indeed have not aged as well as the original. In any case, when I first got the game I did not have this context and so to me it was the Railroad Tycoon, not a second version.

As a curious side-note I think the developers themselves were a bit conflicted whether the "Deluxe" version was actually an improvement. The manual, likely approved by some marketing department, has one version of the story to tell, but there is a readme1.doc (I wonder where the 1 comes from) with a slightly different phrasing. I think it is a fascinating document of its time and it probably doesn't hurt to cite it here in its entirety, ASCII-Art and all (Figure 3a). While we are at it, there is another, slightly more official looking readme file called readme.now. Among other things it contains an apology for using the wrong data types (Figure 3b).
______________________________________________________________________________________________________________________

Code: Alles auswählen

      ______             _   _  ___   __             __
 \/___|    |    | | |   | | | |  |   |_   /\  |\  | |__
 |     [][]|    |-| |   |-  |-   |   |   /__\ | \ |    |
 |         |    | | |   | \ | \  |   |  /    \|  \| ___|
/ oo--OOO-o


Release Notes:  Update for Railroad Tycoon Deluxe to version 1.01

Bug Fixes:

  1)    Save game files are now fixed.  The bug did not show up in
        our QA process.  For those of you who are versed in programming
        we forgot a parameter in a create call.  Since the create call
        was using a return address to fill in the blank it would only
        fail if your machine config was just right. Or wrong in this
        case. (I took time to explain this because someone out there
        was pretty vocal about how we could miss this.  Well now you
        know.)

  2)    Center Key now works 'C'.

  3)    Roland sound bug fix for 486 33MHz and above.


History:
   I thought some of you would like to know how we came to publish
Railroad Tycoon Deluxe.  Our Japanese development group released
Railroad Tycoon in November of 1991.  The game was a direct port to
the NEC 9801 from the IBM.  Our Japanese customers looked at the EGA
graphics and said "Uggh!" or the Japanese equivalent.  This motivated
us to rework the graphics system for a 640x400 16 color graphics
mode, (NEC Format.)

   In the process of reworking the graphics, we decided to add new
theaters and trains.  After we released the Japanese version of the
game we were asked if we could release this product for the US/UK. 
Because of the interest we had on the BBS's we decided to include it
in our Spring 1993 release.

We have concentrated on new worlds and economies and a sound and art
upgrade.  I have read your posts.  Please continue sending us your
comments and suggestions.  Understanding what you are looking for
will help us include appropriate suggestions in future products.

Thank You for purchasing
    Railroad Tycoon Deluxe


    Jeff Billings
Figure 3a: We listen to our fans: On BBS's!

Code: Alles auswählen

APOLOGIES:
        We are unable to fix a money roll over problem that manifests
    in a couple of different ways.  Your cash may mysteriously become
    negative, or you may end up with spikes in your stock graph, or 
    if your Railroad net worth rolls over, you may be kicked out of
    office.  This is an indication that you have done so well as to
    exceed the design parameters of the game.  
        
        We were also unable to raise the restrictions on the number
    of stations and trains available without completely rewriting the
    entire game.  There is room in the future for a bigger, better,
    faster game in the Railroad Tycoon line and we hope you will be
    with us at that time for more enjoyment.

                      
                      Thank-you for choosing RAILS DELUXE,

                            The RAILS DELUXE Team.
Figure 3b: Did these people learn nothing from Y2K? Oh ... [Aside: is it still safe to assume all readers are familar with the Y2K-Problem?]
_______________________________________________________________________________________________________________________

Being a mixture of old code from 1990 (original) and slightly newer code from 1993 (deluxe), and given the huge leaps of technology in that time, it is interesting to see that parts of the game are fighting for every bit -- using half bytes instead of bytes at places, having hardcoded limits for things at convenient powers of two -- while others parts are incredibly wastefull -- fullscreen renders with very little value to gameplay, redundant content, advertisements for other, long forgotton games.

There is another reason back from 1996 for which I call the game "RRT" and which is strangely relevant to my endevours years later. Consider this gem from the game manual:
Loading
This assumes your machine runs under DOS 5.0 when it boots, which is true of most IBM and compatible machines with hard disks.
(1) Turn on your machine. If it is already on, exit all programs and return to the root directory with the “CD” (change directory) DOS command. For example, if your hard disk is C: then “CD C:\” does this.
(2) Set Speed: If you have a “turbo” or multi-speed computer, use your normal speed setting.
(3) Enter MPS directory: You can enter the MPS directory with the “CD” (change directory) DOS command. For example, if you are in your root directory C:, then “CD MPS” will take you to the MPS directory.
(4) Load Program: Type “RDX” and press return. The simulation will begin loading.
For some readers this will bring back fond memories, others will probably be to young to have any idea what it is about. Both groups can be helped by watching retro computing videos on youtube, here is one about the Turbo Button. For me, when first starting the game, these instructions were at the same time much needed as they were confusing. What wasn't on the Windows 95 start menu didn't exist for me back then and Windows 95 machines did not boot into "DOS 5.0" whatever that was.

When following these instructions the game would greet you with a about 90 seconds of credits which, unlike in modern games, didn't even help to hide loading times. Figure 4 shows a screenshot from DosBox. It could be slightly shortned, but not interrupted completely, by hitting <ESC> at just the right time, but it got boring quickly.

______________________________________________________________________________________________________________________
intro_010.png
intro_010.png (8.33 KiB) 3366 mal betrachtet
Figure 4: Animated Credits
_______________________________________________________________________________________________________________________

Somehow I got my hands on an old manual for MS-DOS and learned that the RDX command, faithfully typed in per the instructions from the manual, was in fact not a command at all but rdx.bat and could be viewed in an editor. That revealed that two .exe were called, intro.exe and rrt.exe. Finding this out probably is the first thing I did that remotely resembled "hacking" in the traditional, technophile sense of the word.

From that day on, I always called rrt.exe directly and never had to watch the boring intro again. And therefore, for me the game is called RRT. Until ...

What Next?
intro.exe, disregarded for many years, loads the same media file formats as the full game but is much smaller (34 KB vs 263 KB) figuratively turning a 1000 pieces jigsaw-puzzle into one with 128 pieces. Anybody who has actually solved a jigsaw-puzzle knows that this is a huge difference. In the following I will finally describe finding the image decompression routine in intro.exe. But "Extracting: Intro.exe" would have made for a boring title.

Part III is online but it is not what I thought it would be.
Benutzeravatar
Schrompf
Moderator
Beiträge: 4518
Registriert: 25.02.2009, 23:44
Benutzertext: Lernt nur selten dazu
Echter Name: Thomas Ziegenhagen
Wohnort: Dresden
Kontaktdaten:

Re: Extracting: Railroad Tycoon

Beitrag von Schrompf »

Good story. Thanks for explaining!
Häuptling von Dreamworlds. Baut an was Neuem. Hilft nebenbei nur höchst selten an der Open Asset Import Library mit.
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

Schrompf hat geschrieben: 16.08.2019, 11:03 Good story. Thanks for explaining!
Thanks for the kind words.
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

tl:dr; Sid Meier's Railroad Tycoon Deluxe .pic files are LZSS-encoded. The rest of this post is pure meta.

[EDIT Note: I reworked this post, which is Part III, after writing, but before posting, Part IV]

QuickTime (tm?) won't Open my Files

To most laypersons "PDF" (the portable document format) ".pdf" (the filename extension commonly associated with it) and "Acrobat Reader" are probably vaguely the same thing. Does this make sense? You decide. Is it understandable, in terms of how Windows displays the underlying information? I think it certainly is. In that respect ".pdf" works completely different from let's say ".py" (python code). The ".py" filename extension tells us nothing about the actual file format (UTF-8? ANSI? UNIX-Lineendings? Tabs? Spaces?) and it is probably wise to have Windows associate the extension with something like notepad++ even though one does most of the actual editing in an IDE.

All of this is completely obvious once you stop and think about it, but it took me way to long to actually do stop and think. I grew up with the misconception that the filename extension is there to tell Windows with what application to open a file, whereas it is actually there to tell the user what, roughly, they might expect the content of the file to be. I think it is interesting how Windows doubles down on the misguiding interpretation by hiding the filename extension from the user, which, of course, is a completely terrible idea. So while I turn the filename extension display back on as the first thing whenever I use a new Windows installation, at the same time, unless I am very careful, I find my thoughts about these things being confused by the wrong intuition Windows has been feeding me for more than 20 years now.

The point being that I have no name for the file format which was used by RRT for its pictures. The files themselves do not tell. They do seem to start with with these four bytes 00 48 38 00 which might hint at a format indicator but that is impossible to google. I'm incredibly tempted, as a result of the confusion I just described, to call the format ".pic-file-format". I had to remind myself several times that this is nonsense, and fight my own confusion. The only thing that ".pic" really tells us, is that the file is probably a picture, which is correct.

Writing all of this down, a memory came back to me. Back in the day, several games (Myst, I think, but also others) used QuickTime which, when installed saw it fit to register itself as the application to at least one of the filename extensions which RRT used, I think it was said ".pic". I now remember the disappointment when QuickTime could not actually open the file. So in a way, I'm making good on that disappointment by decoding the files myself.

Non-Standard Experiences

Speaking of disappointment ... I opened this thread on a down note, very obvious in the initial post. I think we all know that feeling when you finish something and for some reason rather than euphoria you think more along the lines of "what? that was it?". A bit like leaving cinema after an not entirely satisfying movie. When wrtiting the next part I came to realise that more than being a project, extracting RRT was an experience for me. It made me see more clearly were I had no intuition or the wrong intuition for technical facts I thought I understood perfectly, intellectually. I do not know how relatable my confusion over something so simple as a filename extension is, and I should probably not admit to that confusion in public, but it has to be seen through that lens.

I planned this thread to mirror the structure of Krishty's thread, which is why I referenced the latter so much in my first post. I especially planned to point out, where my experiences differed from those he describes and my thoughts on why. To that end I should emphasise at this stage, how different the .pic situation is from the TIM situation in Krishty's warmup post. He writes:
Das Standardtexturformat der PSX ist TIM. Die meisten Tools des PSX-SDKs arbeiten mit diesem Format, und es ist direkt auf die PSX zugeschnitten. Nicht jedes Spiel nutzt dieses Format, aber die Wahrscheinlichkeit ist recht hoch.
From what I saw, this degree of standardisation was unheard of a few years earlier when RRT was developed. There is no standard .pic format, not only in the sense described above, that .pic just means "It's a picture" but also in the more narrow sense that even within one development ecosystem, in RRT's case Microprose, people could not agree on a file format for pictures.

The community has investigated into .pic files from "Sid Meier's Covert Action", "Sid Meier's Civilization", and Darklands. I have not made any in depth comparison recently -- something I would love to do but not as a priority right now -- but as far as I remember they all were different. And different from RRT's format. This is to say I am not aware of any other application that uses exactly RRT's file format. (Of course it would be embarassing now if I post the code and someone identifies that as a straightforward variant of DEFLATE or the like, but I'm pretty sure that is not the case).

I am still unsure if I even should count these other .pic adventures as previous work on my topic. Is it really previous work if it's about a different file format? I wouldn't regard other formats of .pic files -- like the one QuickTime apparently knows -- previous work just because of the filename extension. Are the formats somehow closer related because it says "Sid Meier" or "Microprose" on the box? I do not know.

What's next?

Since this is the second draft of this post, unlike in the previous ones I know what will be next. Before that, let me conserve some thoughts from the first draft of this:

When I began to write this series I knew it would take a long time to write it up and that I might loose interest before I finished. Several abandoned attempts at blogging taught me that much. I therefore felt obliged to provide at least one piece of information that brings us closer to the actual point of knowing how to decompress RRT's .pic files. So right in the tl:dr; I gave away that the files are LZSS compressed even though this post is not about LZSS. The next one will be ... sort of.

Word of warning though. If you are mostly waiting for the actual format or rather decompression algorithm I'm afraid they are pretty boring. I promise I will try to get there eventually but as you might have notices if you have read so far, that is not my main goal with this writeup. So until I get to the point, I hope the backstory of my experiences is mildly entertaining.
Benutzeravatar
Krishty
Establishment
Beiträge: 7858
Registriert: 26.02.2009, 11:18
Benutzertext: state is the enemy
Kontaktdaten:

Re: Extracting: Railroad Tycoon

Beitrag von Krishty »

LZSS, cool – like in Ace Combat 3, but probably not in separate streams.

It’s pretty impressive to see that games in the 90s had just a dozen different compression algorithms in use. That’s probably due to the fact that search engines weren’t quite mature and instead, game programmers all read the same books. (The same reason quaternions were largely unknown to game programmers until the early 2000s.)

Nowadays we have PNG (DEFLATE), JPEG (DCT + Huffman) and Webm derivatives (Wavelets) for textures on the hard drive, DXT (three different families) for textures on the GPU, Zip/gzip (DEFLATE again), BZip (BWT + Huffman) and 7-Zip (LZMA) for game archives, Mermaid/Selkie/Kraken (LZdontknow) for high-performance archives, H.265 (DCT + Motion detection + arithmetic coding) for videos *plus* the in-house stuff for things like terrain compression in flight simulators. Sucks to be a reverse engineer in the 2030s :)
They do seem to start with with these four bytes 00 48 38 00 which might hint at a format indicator but that is impossible to google. I'm incredibly tempted, as a result of the confusion I just described, to call the format ".pic-file-format".
48 38 is H8 in ASCII, so I’d just go by 0-H8-0 or the like.

On the other hand, TFX’s SPR files are entirely different from id’s SPRs, but I call them SPR anyway.
seziert Ace Combat, Driver, und S.T.A.L.K.E.R.   —   rendert Sterne
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

tl:dr; Wall of text - again, sorry - about how entropy makes me feel stupid.

Update: Three years later I think the main point I wanted to get across here, but likely failed to, is that two seemingly contradictory facts are simultaneously true. (a) Once you have found the right part of the disassembly, it is not all that hard to write a decoder in a higher level language and the algorithm while certainly somewhat ingenious is pretty straightforward (b) it is actually not that easy to understand what is going on here if you do not know where in the disassembly to look, for reasons that, at the time, I felt important to enumerate.

I have edited my original text slightly to make this point more clear.


LZdontknow

[Heading borrowed from here.]

In Part III I said that the RRT's .pic files were compressed with LZSS. This might be too bold of a statement. What I should be saying is that the rather abstract description on Wikipedia roughly fits what I am seeing in the disassembly. I might be wrong in my classification. I have checked other early LZ-family algorithms I am aware of and they are a less good fit.

It is important to not imagine that I found some LZSS reference implementation, ran it on my data and that was it. RRT certainly uses a variation of LZSS, maybe that variation has got its own name but if so I don't know it.

All the algorithms from the LZ-family have one thing in common.

Update: I think this goes to show that I was very unsure what audience I am writing for. I absolutely should have explained what that "one thing" is but apparently decided to assume the audience knows. Now I have forgotten the details and the best I can do is to suggest to look up LZ-family algorithms yourself. It is a cool algorithm, mind you, so it might be even worth it.

There are dozens of degrees of freedom when translating the central idea of LZ-family algorithms into a concrete implementation. It begins with the question how the compressed data could be stored in a stream of bytes - or even bits in many cases. For a simple example of what might go wrong think endianess.

Cryptoanalysis - The Easy Parts

In cryptoanalysis 101 they will teach you that you can easily - in fact automatically - crack a simple substitution cipher with frequency analysis and a dictionary. You know the drill. Count the frequency of the substituted letters, the most frequent one will also have been the most frequent one in the plaintext. For many languages that might be "e". Work your way from there. What they do not tell you, and what I did not fully appreciate as a problem before looking at RRT, is that this starts with a notion of substituted letters. Which you don't have. You have a bitstream.

What I am trying to do, so to speak, is cryptoanalysis, without the actually hard part. I have the compressed data, obviously, to me no better than ciphertext. I have the decompressed data, the would be plaintext, sort of: I can see the picture on the screen. I have the algorithm, albeit hidden in the blackbox of intro.exe. I am surprised how hard the "not hard" part is in reality. I begin to wonder how anybody ever does the "actually hard part" outside of the tidyness of cryptography-as-mathematics.

My task is to figure out what's in the blackbox not in the abstract sense of naming one algorithm or another, but in terms of implementation details. I have to get it right, down to the level of bits and bytes. By poking the blackbox a bit, I arrive at a pretty good idea how the decompressed data is stored in memory.

Update: I realise now, reading those lines again, with some amazement, how much this writing style was inspired by Shamus Young's writing. Shamus passed away last month. I will miss his writing.

Real cryptography keeps its secrets even if you know the algorithm and all implementation details, as long as you do not have the key. Thankfully this is not a problem I have. Once I have the source code, I can decode any number of .pic files. And yet, even now that I know all the details of how the files are decompressed, if you gave me only the compressed data, the uncompressed data and the info that it is LZSS, I could probably not derive the implementation details from that alone. Like I admitted in Part I the major breakthrough in my analysis was to find the right section of the disassemby, not some deep insight derived from first principles. I had to open the blackbox.

Update: I do no longer have strong feelings about this but apparently at the time it felt a bit like cheating to me that I solved the riddle by disassembly rather than analysis. I remember that it was a genuine surprise to me that this was neccessary and the sidetrack on cryptoanalysis is meant to explain that surprise: I was taught that this was the trivial part of a much larger issue.

Entropy isn't much Fun

I do not really remember since when I knew that the data was compressed. I think that was clear pretty fast. Previous work told me that I was probably facing some combination of an LZ-family algorithm and run-length encoding. Had there been no previous work, I could have learned from here that LZ and RLE was just how 90s programmers rolled. Cf. also Krishty's comment above. But most importantly, after ruling out actual encryption as not very likely -- although I think some later games did encrypt their data -- compression was the only explanation for the lack of patterns in the data or rather the nature of the patterns that were there.

We make sense of the world by recognising patterns. This influences how we canonically store data on a computer. There is no law-of-nature saying that text has to be stored as strings of characters and pictures have to be stored as spatial arrays of pixels but that is how humans make sense of text and pictures and so we naively represent and store things that way. We want our texts and pictures to be patterned, otherwise they would be just gibberish and noise to us. Compression, with apologies to Claude Shannon, is the act of replacing patterns by a shorter representation of the whole pattern, making the resulting string of representations shorter and less patterned than the original.

I learned that knowing this in the abstract and looking at relatively high entropy data for relatively long periods of time are two different things. I can somewhat recommend the latter as an exercise. It really puts some things into perspective.

_______________________________________________________________________________________________________________________

55 55 00 FF F8 FC FF F8 FC FF F8 FC FF F8 FC FF F8 FC FF F8 FC FF F8 FC DB 56 FF FE FF FF F8

Figure 5: The first few bytes of pixel data. Sort of. There is a pattern (FF F8 FC) but what does it mean?
_______________________________________________________________________________________________________________________

Entropy takes the neatness out of things. That neat trick where you render your data as a bitmap and see what is is? Won't work! That other neat trick to find regular spaced repetitions in the data? Guess what!

_______________________________________________________________________________________________________________________
2019-08-28 00_09_16-LABS.PIC - IrfanView (Zoom_ 377 x 472).png
Figure 6: See the pattern?
_______________________________________________________________________________________________________________________

What's next?

I'm glad to have this out of my system, so to speak. In ther following parts I will finally try to get technical.
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

I cannot promise that there will be a new part to this anytime soon, but in preparation to a possible continuation I felt the need to revisit and clean up the last part of this series. So I am taking the liberty to push the Thread.
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

When I started this thread three years ago I knew that if I did not write down my detailed findings, I would forget them. Then I chose to get all philosophical instead of writing down my findings and guess what happened.

Disclaimer

I learned the semi-hard way to never publicly share code. You quickly throw something together in your free time, maybe even in a language you very rarely use, and instead of taking it for what is is, somebody will surely think it is code review time and lecture you how you are totally uneemployable if you write code like that. Reluctantly I will share code here but please (a) don't lecture me on style [lectures on actual content are welcome] and (b) if you are an employer googling my name please do not base your hiring decision on this code.

.pic File Header
_______________________________________________________________________________________________________________________

00 48 38 00 80 02 90 01 00 00 0A 0F 0F 05 00 0A 00 00 0A 0A 0A 00 00 08 08 08 0A 05 00 0A 0A 0A 05 05 05 05 05 0F 05 0F 05 00 0C 0F 0F 05 05 0F 0E 0B 00 00 00 0F 0F 0F

Figure 7a: File Header Data Example

Code: Alles auswählen

typedef struct
{
    uint8_t unk_[4];
    uint16_t width;
    uint16_t height;
    uint8_t palette[palette_size * 3];
} pic_header;
Figure 7b: File Header Structure
_______________________________________________________________________________________________________________________

Some things you can just guess. That the file has a header at all (as opposed to raw data) is a pretty safe bet if you open several files of the same type and they all start with the same few bytes. As briefly discussed above, I still do not know what 00 48 38 00 signifies but it probably is some sort of format/version identifier. After that, 280 hex is 640 dec and 190 hex is 400 dec. So clearly this has to be the image resolution. Well, I don't know about you but I do not read binary data every day. I have JSON and XML for the better or worse. My brain is not used to parsing 80 02 as 280 (damn you, endianness) and the constant 280 (hex) is nowhere nearly as familiar to me as 640 (dec). On top of that, I think I might belong to the last generation that recognises 640 (not a power of 2, mind you) as "clearly" being a display resolution. It is a long way from modern displays.

If you know what to look for, the rest of the header (00 00 0A 0F 0F 05 00 0A 00 etc.) looks suspiciously like palette data. Or, I guess, like RGB pixel data, but the pattern breaks after 16 colors. I do no longer remember how I actually found out that this was a palette, I am pretty sure I manipulated the data, looked at the resulting image on screen and went "of course" at some point. In any case, not much magic here.

Layers

We have a palette with 16 colors which gives us 4 bit per pixel. These are not stored in order but as 4 consecutive 1-bit images which I call layers. This seems unusual to me, I would expect the data for one pixel to be grouped together. Then again I am not an expert for graphics formats. If somebody can guess why this was done I would be interested. [Aside: there is another (uncompressed) file format present in the game files that stores two 4-bit pixels together in a byte, so much for consistency].

Again, as far as I remember I found out about the layers by manipulating the data and noticing a change of color rather than a change of pattern. In any case I new that the data was "layered" before knowing how to decompress it. Finding this out required a lot of experimentation because the data being compressed means that small changes in the data can have large and seemingly chaotic effects on the resulting image but eventually it became clear what was going on.

Compression

The first 16 bits of each layer contain the number of bytes in the (compressed) layer. We read the whole layer into in_buffer which we index with in and write the uncompressed data into out_buffer which we index with out. We use a helper function next_bit that will return the next bit from a cache of 16 bits and once the cache is empty read another 16 bits from the in_buffer. I have not investigated this but at least in principle this caching implies that other data might be read from the in_buffer before the full 16 bits in cache have been processed, meaning that the bits as a whole are not processed in the order in which they are stored, adding more complexity to understanding the data.

With that said, the code below is a direct translation of the disassembly, which shows in the variable names, e. g. cf means carry-flag.
_______________________________________________________________________________________________________________________

Code: Alles auswählen

while (1)
{
	uint8_t cf = next_bit(&bitreserve, in_buffer, &in);

	if (cf != 0) //0039
	{
		out_buffer[out] = next_byte(in_buffer, &in); //003B
		++out;
		continue;
	}
	else
	{
		int16_t offset = 0x0;
		uint8_t cx = 0x0;

		cf = next_bit(&bitreserve, in_buffer, &in);

		if (cf == 0) //004A
		{
			cf = next_bit(&bitreserve, in_buffer, &in);

			cx = (cx << 1) | cf; //0056

			cf = next_bit(&bitreserve, in_buffer, &in);

			cx = (cx << 1) | cf; //0062

			cx += 2;

			offset = (0xFF00) | next_byte(in_buffer, &in);
		}
		else
		{
			offset = next_word(in_buffer, &in);

			_register bt;
			bt.word = offset;

			uint8_t ah = bt.bytes[1] & 0x07;

			bt.bytes[1] = bt.bytes[1] >> 3;
			bt.bytes[1] |= 0xe0;

			offset = bt.word;

			if (ah)
			{
				cx = ah + 2; //007C
			}
			else
			{
				uint8_t al = next_byte(in_buffer, &in);

				if (al == 0x0)
				{
					break; //00BC
				}

				if (al != 0x1) //008D
				{
					cx = al + 1; //0091
				}
				else
				{
					//On a 16 bit system there is the need to rebase everything
					//if 16 bit adressing is not sufficient.
					//Here we just use a wider data type for in and out
					//However this is untested

					printf("%s", "Untested skip of rebasing");
					continue;

					/*
					bx = out; //0096
					out &= 0xF;
					out &= 0x2000;
					bx = bx >> 4;

					uint8_t * eax = out_buffer;
					eax += bx;
					eax -= 0x200;
					out_buffer = eax;

					bx = in;

					in &= 0xF;

					bx = bx >> 4;

					eax = in_buffer;
					eax += bx;
					in_buffer = eax;

					continue;*/
				}
			}
		}
		for (; cx != 0; --cx)
		{
			out_buffer[out] = out_buffer[offset + out];
			++out;
		}
	}

}
Figure 8: How to decompress a layer
_______________________________________________________________________________________________________________________

So what is going on here? Mostly the bitstream in bitreserve acts as a control stream as explained on Wikipedia. The central idea is to either write a byte directly from input to output (no compression) or to reuse a sequence of bytes already present in the output denoted by starting position and length (compression). The first if/else makes this distinction. In the compressed case, the number of bytes to be reused will end up in cx (originally a register hence the name) and the starting position of the reused sequence will end up in offset as a negative number. The for-loop at the end does the copying. In hindsight I could not say why this is not implemented as memcpy probably because a) I do not use C all that often and b) this is a translation from assembly.

The pair of cx, offset can be stored in different ways. If offset fits into one byte then cx is actually stored in the control stream. If offset is larger than 8 bit, cx, offset are stored together in one word (16 bit) where cx is in the lower three bits of the high byte (?!). If cx does not fit into these three bits, it is stored as a separate byte. In other words this really cares about saving bits.

I think that is all you need to understand the idea behind the code. I will not claim that I myself understand / remember all the details so that is all for today.
Benutzeravatar
Krishty
Establishment
Beiträge: 7858
Registriert: 26.02.2009, 11:18
Benutzertext: state is the enemy
Kontaktdaten:

Re: Extracting: Railroad Tycoon

Beitrag von Krishty »

Awesome 👍 I had the same experience sharing code, but nowadays I just don’t give a f and post it anyways 😁 Thanks for sharing!
seziert Ace Combat, Driver, und S.T.A.L.K.E.R.   —   rendert Sterne
rhaul
Beiträge: 2
Registriert: 22.09.2022, 10:04

Re: Extracting: Railroad Tycoon

Beitrag von rhaul »

Zunächst möchte ich mich für mein Deutsch entschuldigen, da ich Spanier bin und die Sprache nicht vollständig beherrsche und daher den Google-Übersetzer verwende.

Obwohl ich kein Programmierer bin und mir dieser Welt überhaupt nicht bewusst bin, habe ich alle Informationen über diese Untersuchung gelesen, so gut ich konnte.

Ich übersetze gerade das Railroad Tycoon Deluxe-Spiel ins Spanische und es war eine angenehme Überraschung für mich, diesen Thread zu finden, aber ich würde gerne wissen, ob Sie planen, ein Bildbearbeitungstool für dieses Spiel zu erstellen, da ich daran interessiert bin, das berühmte zu bearbeiten "Belohnungs"-Poster "der Banditen, die auf den Karten von Nordamerika erscheinen (zumindest denke ich, dass es Bilder sind, da ich die Texte mit dem Hex-Editor nicht finden kann)

Alles Gute.
Alexander Kornrumpf
Moderator
Beiträge: 1946
Registriert: 25.02.2009, 13:37

Re: Extracting: Railroad Tycoon

Beitrag von Alexander Kornrumpf »

Hi rhaul,

unfortunately I do not speak Spanish, both German and English are fine. Google translate might work slightly better with English.

I think you mean this:
WANTED.PNG
WANTED.PNG (6.4 KiB) 280 mal betrachtet
It is indeed stored as an image, not text.

I am not sure if I will ever find the time to build an encoder, cannot promise anything.
rhaul
Beiträge: 2
Registriert: 22.09.2022, 10:04

Re: Extracting: Railroad Tycoon

Beitrag von rhaul »

Alexander Kornrumpf hat geschrieben: 22.09.2022, 11:00 Hi rhaul,

unfortunately I do not speak Spanish, both German and English are fine. Google translate might work slightly better with English.

I think you mean this:

WANTED.PNG

It is indeed stored as an image, not text.

I am not sure if I will ever find the time to build an encoder, cannot promise anything.
Actually that picture is what I was referring to.

Anything that helps to translate this great game is welcome and it will surely help other translators into other languages, that includes German.

All the games you mention in the thread are wonderful, in fact, I was even thinking of translating Darklands but I had to discard it precisely because it didn't have translation tools, Civilization if I was able to translate it and I was even able to edit the images thanks to another programmer who is a fan of the civilization saga, the same could help you.
Zuletzt geändert von rhaul am 26.09.2022, 11:33, insgesamt 1-mal geändert.
Antworten