HFS on SSD, logical page map with duplicates

Started by peterpion, September 14, 2019, 17:08:44 PM

Previous topic - Next topic

peterpion

Hi Elmar,

I have an interesting(ish) problem I have been working on for a while. I had a SSD with a hackintosh installed on it (IE OSX on PC hardware) and the SSD stopped working, probably because of a poweroff at a bad time. The drive still responds in engineering mode (default chipset firmware) and through this and the OpenSSD project which luckily was made for my chipset (barefoot) I have been able to create a physical image of the drive. In this I found many 16kb long pages of text like cached web pages.

Analysis revealed a few percent of the drive populated with what looks like logical page numbers in a very clear pattern so I have used them to create a translation table (logical to physical address). I then used the stackbd module which I modified to allow me to mount the image as a loop device and map OS block requests to the pages pointed to by the map file. Doing it this way allows me to quickly try different mapping options to see what works best.

Having mounted what I hoped would be a reassembled logical image I tried mounting it but with no joy (I used hfsrescue to determine the partition start block and losetup to mount a range of blocks). So I ran hfsrescue on it and quite a few files have been recovered to my surprise. Many of these files were photos around 10MB in size and since the physical page size is 16k it seems that many pages being correctly placed contiguously must mean my mapping is at least partially correct (your opinion on this would be appreciated).

The big problem (I believe) is that around 5% of the page LPNs are replicated. This must be where a page has been modified and rewritten in a different place on the disk but the original page containing out of date data  (and its LPN) is not yet erased. So I assume this makes 5% of my drive corrupt. Perhaps most of these pages are ones used for directory structures, it would seem to make sense since directories are likely to have changed much more than files in the last few weeks of the disks operation (I was moving things around a lot, reorganising ready for transfer to a new backup machine :-()

Although working with the firmware and c to extract the drive has been something I have experience of understanding HFS is a different matter and I am having difficulty with it. I wonder if you have a deep understanding of it perhaps you could suggest where to go from here. I suspect the image I have is correct apart from 'holes' in it, pages of incorrect data. The volume header is found at around 14MB offset (is that normal-ish?). For instance the page that a directory uses might be replaced with a chunk of an image file as an example (the ssd moves data around internally according to wear leveling etc). Ideally I would like to examine the initial parts of the disk IE find the root directory page(s) and follow the links (references to other pages) in it to subdirectories, and then try the various pages which might be valid to see which seems right (IE the situation I have is for 5% of the logical pages of 16kb size I might have 2 to 100 copies, only one of which is the correct one, and I don't know which to use).

BTW the terminology confuses me a bit, because in the SSD firmware code project blocks are sections of 2MB or so, pages of 16kb are the smallest data size it uses. Whereas in OS programming blocks seem to be 512 bytes etc. I hope you understand what I mean above.

Perhaps my mapping is totally wrong and your program is just reassembling randomly placed 16kb pages on the disk - is it capable of this? I am also working on discovering where the drive stores expired page information but so far I have not found it (this was of course my first choice but it just does not seem to be there, perhaps it uses a scheme too complicated for me to figure out by looking at bytes with no context).

I am wondering if I can figure out for a few blocks which are the correct ones manually, it might give me a clue as to how to identify valid from expired blocks.

My tools BTW are mainly c and hex editors, the easiest ways of working have seemed to be writing small c programs to do the searching etc.

Anyway thats it for now, your thoughts would be appreciated if you have the time!

Kind regards, Pete

Elmar

#1
Hello Pete,

till now, I did not go as deep as you did into the SSD low level topic. I never looked at a SSD firmware source code. So, I don't know if I can help you. Also its difficult over the web, without a copy of the drive.

Quote from: peterpion on September 14, 2019, 17:08:44 PM
The volume header is found at around 14MB offset (is that normal-ish?)

This location can have various reasons.
Is there a partition before the data partition?
Maybe the EFI boot partition?
Is this the location after your low level SSD extraction?


Quote from: peterpion on September 14, 2019, 17:08:44 PM
BTW the terminology confuses me a bit, because in the SSD firmware code project blocks are sections of 2MB or so, pages of 16kb are the smallest data size it uses. Whereas in OS programming blocks seem to be 512 bytes etc. I hope you understand what I mean above.

When you write a file to the drive, then there are a few levels with (but not always) different block sizes.

You have the file system block size, which is (depending on the FS type and disk size) nowadays 4096 bytes (4K).
You have the standard block size of the drive 512 bytes for hard disks.
Inside the disk, you maybe have a block size of 512, 4K or another, which is not important for the upper levels. Except, its a matter of speed. When you have a disk with a 4K block size, then its better to align the partition/file system to 4K (is standard nowadays and aligned by the partitioning software). However, usually the internal block size is only important for the firmware.


What you already figured out is, that its required to create a disk image, where the blocks of the SSD are at the correct location in the disk image. I don't know where the SSD firmware gets/writes the information the for the block translation. Maybe you found already such information? We can also exclude the wrong duplicate blocks when we know, how the translations works.


Best regards
Elmar

peterpion

Hi Elmar,

Thanks for the reply. Its been 2 years since I last worked on this - I put a few weeks into it and then due to family illness had to drop it suddenly without making any notes of where I was. Its a complex problem and I have forgotten much of it. Picking it up again recently I've been reading the source code I created and playing with the whole thing to re familiarise myself with it.

Re the EFI point you made I think you are right. I see in the source for the stackbd driver (the kernel module I modified to remap sector requests from the operating systems opinion of where a page resides in the drive to where I actually think it resides) and I notice that I skip the first 200 MB or so. Then 14MB after that I find the volume header so the total space before the HFS header is around 214MB. Reading up on the EFI partition it seems it usually is around 210 MB which sounds about right. I probably originally found the volume header by searching for the HFSJ string and decided to skip some space before it to speed up scans.

Re the SSD aspect and how it complicates matters, really the only unique thing about SSDs is that they don't store logical pages (or blocks, whatever the right term is) sequentially. They mix them up randomly and move them around depending on how often they are used. So data that is very infrequently modified sometimes gets moved to physical memory that has been used a lot, to make sure that the physical memory wears out evenly (you can only write to each SSD block around 10000 times before it starts to wear out).

Also you can't partially delete a block or overwrite it, so you have to copy it to a new location, changing the part of it that you need to change.

The smallest size of SSD memory you can write is 16k, and the smallest size you can erase is around 2MB (so you have to erase 2MB chunks at a time).

Because of all this the SSD moves blocks around constantly. It stores the logical address of each block and then when it starts up, creates a map in its RAM of which physical block address to use if it is asked for a particular logical block, which it modifies each time it changes the position in physical memory that it stores the block.

Once I got my head round this it is fairly simple. Of course finding the logical block number of each physical block is the problem but that jumped out at me as I searched. The last 16k block in each chunk of 2MB had numbers in it which ranged from 0 to around 15 million. Never higher than this. This is the range that logical block numbers should be. When I used these numbers to identify the corresponding blocks the disk became much more organised and I was able to find very large chunks of contiguous data which I am fairly certain means this is the correct way of mapping physical to logical blocks.

Only problem is as I say the blocks that are moved from one location to another (when they are updated by the OS) are not erased so I end up with duplicates of many blocks (around 700k duplicates out of 15 million blocks).

One idea I had was whether I might be able to modify hfs rescue to take a list of potential block addresses for each block, and see if it can tell which is the correct one. I don't know whether it can detect whether a block 'looks right' or not, but if it can, this might work.

Another idea is based on the mapping table I have made. I think its probably a reasonably organised copy of the drive but I don't know and I would like to examine its structure. Do you know if its possible to 'explore' the structure of a HFS volume manually? Is it built like a chain of directory files which reference deeper directory files, or does it use a central catalogue file (I have seen references to some kind of main catalogue).

If so, could this be manually examined or is it too complex, or could I write a c program to do it for me? Is there any part of hfs rescue or any other tools you know of that I could modify to assist this? 

I am imagining finding some kind of root file that lists block numbers where I find files and also references other similar blocks etc. Some of these files might have been substituted with incorrect blocks now  (chunks of image files, for instance). But in the list that I generated of possible physical block addresses for each logical block address the correct one probably exists and I can probably work out which one is correct. And after doing this for a few blocks I might be able to figure out what marks a block as expired and hence fix the rest of the disk automatically.

But at the moment the whole structure of the disk is a mystery. Hfs rescue clearly makes some sense of it because it can find 2 million or so files in it and recover quite a bit of data so there must be some degree of structure to it. I guess I need to get reading the source code for hfs rescue but I wonder if you can tell me if this approach might work and point me in the right direction!

Kind regards, Pete


Elmar

Thanks for the description. I already know most of that. I didn't know how the translation is stored. I checked some documents and its now clearer.

Are you able to export the NAND flash from the SSD drive?

Duplicate blocks are not the primary problem. This can happen on a healthy FS too and will be managed later by hfsprescue.  The problem is, we need to know what page is a valid page and whats the LBA of the page. We get this information from the physical block address table from the NAND storage.

Best regards
Elmar

peterpion

Hi Elmar,

Yes I believe I have a full dump of the NAND data. The chip storage actually totals 274877906944 bytes, when I read the ID of the NAND chips and look at the datasheet. But the drive reports, IIRC, 256  * 10^9 bytes or so. I don't want to use the suffix GB (or Gb) because drive manufacturers use GB to mean 1  * 10^9 bytes it seems rather than the 1024*1024*1024 bytes that the rest of the computing world understand by the suffix GB. Its been rather a confusing point.

But the chips total the amount I wrote above and I can read all of this to a file 274877906944 bytes in size which I now work from.

The chips also have a spare area of 128 bytes per 4096 (so 512 bytes per 16kb page) but I can't seem to access it. Its possible that important info is stored here but I don't think so, you might see why in a minute.

If you're interested the chip datasheet is here https://docs-emea.rs-online.com/webdocs/0def/0900766b80defffe.pdf

The data is arranged in chips as you know, 32 of them in my case, and the chips are arranged in 'planes' and 'dies'. I had to choose some numbers to give to the firmware before I read the chips, to tell it how to arrange the chips like this for reading. I did some experimenting and settled on a certain configuration which seems to work. I believe this because in the 274 GB dump I mentioned above, I can find many many 16 kb files which I can see are not scrambled. For instance there are segments of HTML 16kb long which I can read through and see are full intact chunks of HTML pages I must have had in the browser cache on the disk.

So that makes me think I have configured the chips correctly although its possible I am wrong.

Now for the interesting bit. The data is as I said organised in 16kb chunks - blocks, pages, I am not sure what to call them (the chip datasheet tends to call them pages). There is 127 blocks of what I am sure is data from my drive, IE my user data. This is where I find things like web pages, image header data etc. Then there is a final block (16kb in size) which contains what I am sure is LBA data plus a bit more. Then the pattern repeats, about 131000 times (giving my about 274 Gb size file). Each of these 128 * 16kb chunks is the size of an erase block (2mb or so).

To describe this last block a bit more, it starts with an identifier which seems to identify the type. Not all 2mb erase size sections are data, some have different IDs in this first block, I don't know what they all mean but they must be drive housekeeping data. I ignore them, for now at least.

Next in this last block is 128 bits which seem to identify whether the following LBAs are valid. A bit is set to 0 when there is a duplicate LBA in the same 2mb section.

Then there are 128 numbers which are all between 0 and 16777216, the perfect range for a 16kb block size 256 gb disk. I assume they are the LBAs. I have written them all out to a 'map file' which should be the translation table between physical and logical addresses you mentioned.

So as I said I have made a kernel driver which can read this map file and translate block requests on the fly. Then I mount the dump of the disk I made, and use the map file to translate requests to the kernel for a particular block to the address pointed to by the LBAs I mentioned above. This 'translated' block device appears as a virtual device in /dev.

So, if I read from the virtual device, asking for say block 0, the modified stackbd module looks for which physical block on the original disk had a LBA of 0 (perhaps its block 12345, 54321 etc) and modifies the request to the kernel to return this block instead of block 0.

By doing this I am translating the arrangement of blocks from the original dump I have, using the addresses that I believe are LBAs, into a device sorted by LBA.

This all sounds good but there are problems. The main one I mentioned earlier is the fact that for many LBAs there are several references to them on the disk. IE the data in the last blocks I believe is LBAs - if I put those LBAs into an array sometimes as I scan the disk I get duplicate entries. There are 16 million LBAs and I find 700k duplicates, around 5%. Somewhere the drive stores the info I need but I can't find it yet, although as I say I wonder if I can get clues by examining the FS itself.

BTW a couple of years ago I was asking for help on hddguru and detailed some of this in more detail -link is here https://forum.hddguru.com/viewtopic.php?f=10&t=35428&mobile=mobile

Elmar


peterpion

OK sorry for the delay in getting this to you! Firstly I will upload the file and then a description in a subsequent post.

peterpion

OK so this file comes from what I call 'superblock' 37. A 'superblock' is 16384*32*128 bytes. The superblock is broken into 32 subunits around 2MB each in size. Each of these has its 16384 byte LBA identification block at the end, preceded by 127 blocks of data each 16384 bytes in size.

So you will see that if we print the initial 100 bytes of the last 16384 byte block (the LBA block) the first 2 bytes are 0x2500. The order is reversed so this is 0x0025 or 37 (for superblock 37). I check this to identify data blocks, blocks without this contain junk.

xxd -s $((16384*127)) -l 100 elmarsfile
01fc000: 2500 7f00 7f00 0000 ffff ffff ffff ffff  %...............
01fc010: ffff ffff ffff ff7f 4d06 da00 5d06 da00  ........M...]...
01fc020: d25d 9c00 065e 9c00 0a5e 9c00 205e 9c00  .]...^...^.. ^..
01fc030: 305e 9c00 4a5e 9c00 5a5e 9c00 635e 9c00  0^..J^..Z^..c^..
01fc040: 785e 9c00 8b5e 9c00 915e 9c00 ab5e 9c00  x^...^...^...^..
01fc050: ae5e 9c00 d85e 9c00 9e60 9c00 c360 9c00  .^...^...`...`..

After this we have 7f00 7f00 - I have not found a meaning for this yet.

Then we see ffff ffff ffff ffff ffff ffff ffff ff7f which make more sense if we convert it to binary and reverse each 8 bit byte after doing this:

1111 1111  1111 1111  1111 1111  1111 1111  1111 1111  1111 1111  1111 1111  1111 1110 

It turns out that if a LBA appears twice in the subsequent list (which I describe next) then the first occurrence of it is marked here by a 0. So if LBA no 1234 is at position 5 and then also 10 in the subsequent list, bit 5 will be 0.

After this 16 bytes, the list of LBAs starts and is 512 bytes long, consisting of 128 32 bit words. Each work is again backwards, as I see it anyway, so the first one you see in the above print out '4d06 da00' is 0xDA064D or 14288461. This is the logical block number for the first 16384 byte data block in the bin file I posted, from byte offset 0 to 16383.

The next 4 byte word in the print out is '5d06 da00' = 0xDA065D = 14288477. This is the logical block number of the next 16384 data block in the posted bin file from byte 16384 to byte 32767.

And so it goes on for 127 * 16384 byte blocks, with block 128 being the identification block.

Then this repeats a total of 4096 times for the entire disk.



peterpion

BTW today I did a little sanity checking, to see if there is evidence my LBA translation seems reasonable. I searched in the translated disk (the virtual device that is the remapped raw disk, translated by the logical mapping I just described above). I looked for the string HTML and choose a good looking one, taking the byte offset. Dumping the next 16384*16 bytes showed what seems to me to be a contiguous block of 256 kb of text which I attach. It reads fine and at the transitions from block to block (at byte offset 16384*n) there is no hiccup - ie it seems to be a faithful replication from what I have read so far of a large original file. Looking at the physical block numbers this came from they are jumbled up considerably:

Physical block numbers:
5213129
12801290
9342987
3414209
384582
1935015
2493452
4813293
12883118
11505071
4423568
3267107
6689989
384646
11505103
5988372

So in other words the mapping scheme I have described above is able to take these physical block numbers and organise them properly, so with 16 contiguous blocks arranged correctly from completely jumbled up original orders I am fairly sure this is correct and working.

Still the problem is for some of the LBAs I have more than one potential physical address it could be meant to point to. None in the above example but I find 700k or so during the creation of the physical to logical mapping.

As an example LBA 12345 might be mentioned 5 times in the LBA identification blocks scattered around various superblocks. So I end up with having to choose from these 5 possible physical addresses - at some point in the past every one of them will have been used as LBA 12345 but which is the latest I don't know.

One intriguing thing is that after the list of 128 LBAs in the LBA identification block there is another list of 32 bit values which are in the range of physical or logical block addresses. This list varies in length, sometimes there are very few, sometimes there are 128 32 bit words. However the last LBA identification block in each superblock never has this. I did wonder if it might be a list of physical addresses that the current block supersedes, but the problem is if that is the case how come the last LBA identification block never contains it? Because of this I have not bothered yet trying to use it to identify stale LBAs but I think I probably should try to code it and see whether it does resolve many of the duplicates. It also shares the 128 bit validity header I mentioned in the previous post.

Cheers, Pete


Elmar

Hello,

unfortunately, I had no time to look at your files. Maybe next week.

Best regards
Elmar

peterpion

Hi Elmar,

No worries I appreciate you looking. I have made some progress since I posted last, now I can mount the disk, but I am unable to descend into some folders (because I assume their directory files are the wrong blocks).

BTW I am looking into finding the main translation table on the disk but if I don't find it, I wonder if there is an easier way.

My idea is to modify HFS rescue to let it choose the logical blocks that make sense to it. IE load it with an array (that I already have made) of the alternative data blocks, it can try them one by one to see if any 'make sense' to it.

You see the problem is that for 4% of the sectors on the drive, I have several possible blocks of data that might be the right ones. I don't know which to choose from. For instance sector 724557 (as the linux OS sees it) might actually be physical block 1298322, 932022 or 331922 in my image file. I have the 'spare' sectors saved in a different place. The 'wrong' copies will have been overwritten with other files, so in some cases I am sure it will not be possible to differentiate between right and wrong, but if I can do that for the file structure (directory information) that would be great.

The reason is that the drive can actually store 274GB of raw data but only shows 256GB of this to the operating system usually.

Anyway as I say as the author of HFS rescue I wonder if you could say if this would be possible. If not, it would save me time looking through the code and trying to make the modifications but if you say yes it would be possible and give me a couple of pointers as to where to code the modifications I will get working on this instead of working on finding the translation table, since it is proving very difficult.

Thanks again, Pete