mailto: blog -at- heyrick -dot- eu

File type detection

I was writing an article yesterday when I decided to browse the images that I had prepared to see what order to put them in.

File not found.

Huh?

I refreshed the browser and saw that the contents of my SD card had been scrambled, blended, and otherwise messed with.
Since nothing had been written to the card, I figured that the data ought to be recoverable. I went to DOS (well, NTVDM) and did chkdsk e: /f and let it get on with the job. Now chkdsk can often "recover" a damaged filesystem, but it does it with no intelligence whatsoever. Damaged things are not recovered, they are just hacked out.
So it created eighty billion .CHK files from the "lost fragments".

Most of these 'fragments' are valid files (and since nothing was written to the drive, should not be corrupt). The only issue is that they are rounded up to the natural block size of the filesystem, or multiples of 32KiB. This may or may not be a problem depending on the filetype.

So far I've recovered some videos from my phone, all of the originals of the photos of the tram-train trip to Nantes with Mick, and some other random PDFs and such that were not backed up to harddisc. I'm less concerned about the animé as I had copied the important stuff to harddisc.

I'm not feeling inclined to examine what may be thousands of files to see what they are, especially given that there is no guarantee that the fragments are even valid files - bits and pieces of leftover files will turn up, so don't panic if you see files containing gibberish or bits of files that you recognise.
Anyway, not wanting to wade through all the rubbish, I decided to throw together a program to do the grunt work.

Here it is. A wodge of VisualBasic to look at all the .CHK files and try to work out what sort of file each one is. The path is hardwired to E:\FOUND.000\ as that's what it was on my setup. Amend this as necessary. In hindsight, I probably should have used a "BasePath$" string or something. Oh well...

Option Explicit

Private Type FilesDef
  FileName As String            ' the filename
  Exten As String               ' what to rename it as
End Type


Private Sub Form_Load()
  ' Allocate space for 10000 elements. chkdsk won't deal with more on one run.
  ' (if there are more files, you can rename and move but DO NOT WRITE as it
  '  risks overwriting files that COULD be recovered)
  
  Dim Files(10000) As FilesDef ' it's Windows, stupid memory claims are the norm! :-P

  Dim Entity As String
  Dim ThisCount As Integer
  Dim FileCount As Integer
  Dim YieldCount As Integer
  Dim WordA As Long
  Dim WordB As Long
  Dim WordC As Long
  Dim WordG As Long
  Dim MyFP As Integer
  Dim OldName As String
  Dim NewName As String
  
  ' Force window open
  Me.doing.Caption = "Initialising..."
  Me.Show
  DoEvents
  Me.Refresh
  DoEvents
  
  ' Scan loop
  FileCount = 1
  YieldCount = 1
  Entity = Dir("E:\FOUND.000\*.CHK") ' **HARDWIRED** to the path of my SD card
                                     '               Tweak as appropriate.
  Do While (Len(Entity) > 0)
    ' Strip OUT directories...
    If (GetAttr("E:\FOUND.000\" & Entity) And vbDirectory) <> vbDirectory Then
      ' If there is an Entity, write it to array...
      Files(FileCount).FileName = Entity
      
      ' Now open it up and read the first three words (3 x 4 bytes) and
      ' the word at +24.
      MyFP = FreeFile
      Open "E:\FOUND.000\" & Entity For Binary Access Read Lock Read As MyFP
      Get #MyFP, 1, WordA
      Get #MyFP, 5, WordB
      Get #MyFP, 9, WordC
      Get #MyFP, 25, WordG
      Close #MyFP
      
      Me.doing.Caption = "Examining file '" & Entity & "' (" & Str(FileCount) & ")..."
      
      ' Now attempt to guess its type - note the backwards byte order
      ' This list is NOT complete (no 7zip, no WMF, etc). These types are simply
      ' the ones that I expect to be present on my SD card...
      
      
      ' Image file types
      
      ' JPEG files begin "yoya" which is 0xFFD8FFE0 / 0xFFD8FFE1
      If (WordA = &HE0FFD8FF) Or (WordA = &HE1FFD8FF) Then _
        Files(FileCount).Exten = "jpeg"
      
      ' BMP images begin "BMxx x0xx" which is 0x424Dxxxx xx00xxxx
      If (((WordA And &HFFFF&) = &H4D42&) And ((WordB And &HFF00&) = &H0&)) Then _
        Files(FileCount).Exten = "bmp"
      
      ' GIF files begin "GIF8" which is 0x47494638
      If (WordA = &H38464947) Then Files(FileCount).Exten = "gif"
      
      ' PNG images begin "%PNG" which is 0x89504E47
      If (WordA = &H474E5089) Then Files(FileCount).Exten = "png"
      
      
      ' Video and audio file types
      
      ' MP4 files with "ftyp" in second word (0x66747970)
      If (WordB = &H70797466) Then Files(FileCount).Exten = "mp4"
      
      ' AVI files have "AVI " in third word (0x41564920)
      ' (can't rely upon "RIFF" in first word as I think wav is same)
      If (WordC = &H20495641) Then Files(FileCount).Exten = "avi"

      ' Matroska files have "matr" in seventh word (0x6D617472)
      ' (usually, I have seen some that differ)
      If (WordG = &H7274616D) Then Files(FileCount).Exten = "mkv"
      
      ' FLV files begin "FLV[" which is 0x464C5601
      If (WordA = &H1564C46) Then Files(FileCount).Exten = "flv"
      
      ' SRT subs begin "1<newline>0" which is 0x310D0A30
      If (WordA = &H300A0D31) Then Files(FileCount).Exten = "srt"
      
      ' MP3 files with ID3 data begin "ID3[" which is 0x49443303
      If (WordA = &H3334449) Then Files(FileCount).Exten = "mp3"
      
      ' MP3 files without ID3 tags *appear* to begin:
      '   FF F3 xx xx 00 00 xx xx
      '   FF FB xx xx 00 00 xx xx
      ' but since I don't know the meanings of these bytes, leave it.
      
      
      ' Archive file types
      
      ' Zip files begin "PK[]" which is 0x504B0304
      If (WordA = &H4034B50) Then Files(FileCount).Exten = "zip"
      
      ' RAR archives begin "Rar!" which is 0x52617221
      If (WordA = &H21726152) Then Files(FileCount).Exten = "rar"
      
      
      ' Other file types
      
      ' PDF documents begin "%PDF" which is 0x25504446
      If (WordA = &H46445025) Then Files(FileCount).Exten = "pdf"
            
      ' HTML begins "<htm" or "<HTM" which is 0x3C68746D / 0x3C48544D
      If (WordA = &H6D74683C) Or (WordA = &H4D54483C) Then _
        Files(FileCount).Exten = "html"
      ' there's also the doctype version, too...

      ' Executables begin "MZ[] " which is 0x4D5A9000
      If (WordA = &H905A4D) Then Files(FileCount).Exten = "exe"
      
      ' RISC OS DrawFiles begin "Draw" which is 0x44726177
      If (WordA = &H77617244) Then Files(FileCount).Exten = "aff"


      ' In case anything else needs to be trapped
      'If (Files(FileCount).Exten = "") Then
      '  Debug.Print Entity, Hex(WordA), Hex(WordB), Hex(WordC)
      'End If
      
      FileCount = FileCount + 1
      If (FileCount > 10000) Then Error "Too many files."
    End If ' is NOT a directory
  
    ' Yield?
    YieldCount = YieldCount + 1
    If (YieldCount > 150) Then
      DoEvents
      YieldCount = 0
    End If
    
    ' Get next entity
    Entity = Dir
  Loop
  
  ' Now we have enumerated all of the files, time to start renaming them.
  ThisCount = 1
  Do While (ThisCount < FileCount)
    If (Files(ThisCount).Exten <> "") Then
      Me.doing.Caption = "Renaming file " & Str(ThisCount) & _
        " of " & Str(FileCount) & "..."
  
      OldName = "E:\FOUND.000\" & Files(ThisCount).FileName ' original name
      NewName = Left(OldName, Len(OldName) - 3)             ' hack off "CHK"
      NewName = NewName & Files(ThisCount).Exten            ' add new extension
  
      Name OldName As NewName
    End If
  
    ' Yield?
    YieldCount = YieldCount + 1
    If (YieldCount > 150) Then
      DoEvents
      YieldCount = 0
    End If
  
    ThisCount = ThisCount + 1
  Loop
  
  ' We're done!
  Beep
  End
End Sub

Some principle applies if you want to do the same sort of thing in BBC BASIC.

 

 

Your comments:

Please note that while I check this page every so often, I am not able to control what users write; therefore I disclaim all liability for unpleasant and/or infringing and/or defamatory material. Undesired content will be removed as soon as it is noticed. By leaving a comment, you agree not to post material that is illegal or in bad taste, and you should be aware that the time and your IP address are both recorded, should it be necessary to find out who you are. Oh, and don't bother trying to inline HTML. I'm not that stupid! ☺ ADDING COMMENTS DOES NOT WORK IF READING TRANSLATED VERSIONS.
 
You can now follow comment additions with the comment RSS feed. This is distinct from the b.log RSS feed, so you can subscribe to one or both as you wish.

No comments yet...

Add a comment (v0.11) [help?] . . . try the comment feed!
Your name
Your email (optional)
Validation Are you real? Please type 33790 backwards.
Your comment
French flagSpanish flagJapanese flag
Calendar
«   January 2016   »
MonTueWedThuFriSatSun
    12
578910
11121415
181921222324
27282931

(Felicity? Marte? Find out!)

Last 5 entries

List all b.log entries

Return to the site index

Geekery

Search

Search Rick's b.log!

PS: Don't try to be clever.
It's a simple substring match.

Etc...

Last read at 14:28 on 2024/04/16.

QR code


Valid HTML 4.01 Transitional
Valid CSS
Valid RSS 2.0

 

© 2016 Rick Murray
This web page is licenced for your personal, private, non-commercial use only. No automated processing by advertising systems is permitted.
RIPA notice: No consent is given for interception of page transmission.

 

Have you noticed the watermarks on pictures?
Next entry - 2016/01/16
Return to top of page