Amcache FileId explained: the SHA-1 hash format Windows stores

The FileId value in Root\InventoryApplicationFile is one of the most useful fields in the entire Amcache hive — and one of the most misunderstood. It is the file's content hash, but it is not quite a standard SHA-1, and it does not quite hash the whole file. This post is the full reference: what the value is, how to use it, and the traps that catch new analysts.

For the broader Amcache reference, see the Amcache complete reference; for the surrounding registry structure, see Amcache registry structure.


What the value looks like#

A typical FileId from a real hive:

0000da39a3ee5e6b4b0d3255bfef95601890afd80709

41 characters total:

  • The first 4 characters are always "0000" — a fixed type tag.
  • The remaining 40 characters are the file's SHA-1 hex digest.

The "0000" prefix is a historical artefact: early versions of the appraiser anticipated multiple hash algorithms (with each prefix indicating which) but in practice only SHA-1 was ever used. Today the prefix is constant.

When AmcacheParser exposes this field in its CSV, it splits it into two columns:

Column Value
FileId The full 41-character string, prefix included.
Hash Just the 40 hex characters — the SHA-1 alone.

Always use Hash (or strip the prefix yourself) when joining against external hash feeds. VirusTotal, your TI feeds, and hash-allowlist databases expect a 40-char SHA-1 — they will not match anything if you include the "0000" prefix.


What it actually hashes#

This is the trap that catches almost every new Amcache analyst:

Amcache's SHA-1 hashes the first 31 MiB of the file, not the whole file.

For files smaller than 31 MiB (which is almost everything — most EXEs and DLLs are well under), the prefix hash equals the whole-file SHA-1. They are indistinguishable from one another.

For files larger than 31 MiB, the Amcache hash is a prefix hash, not a full-content hash. It is still distinctive enough to identify a specific build of a specific binary, but it is not the same value you would get from sha1sum on the whole file.

Why this matters#

  • VirusTotal matches. For files under 31 MiB, the Amcache SHA-1 matches the SHA-1 VirusTotal indexes. For larger files (installers, some game binaries, large enterprise software) it will not match, and a VirusTotal "no record" response is meaningless.
  • Custom hash databases. If you maintain an internal allowlist of known-good hashes, make sure you are storing the same kind of hash you'll compare against. Either store full-content SHA-1s (and accept that large-binary comparisons against Amcache will fail) or maintain a parallel prefix-hash column.
  • Recompiling for verification. If you have the original binary on hand and want to verify that an Amcache hash matches, hash only the first 31 MiB:
import hashlib
PREFIX_BYTES = 31 * 1024 * 1024  # 31 MiB
 
def amcache_sha1(path: str) -> str:
    h = hashlib.sha1()
    with open(path, 'rb') as f:
        h.update(f.read(PREFIX_BYTES))
    return h.hexdigest()

Real-world traps#

A handful of pitfalls that come up on real cases:

Don't include the "0000" prefix in lookups#

# Wrong
search_virustotal('0000da39a3ee5e6b4b0d3255bfef95601890afd80709')
 
# Right
search_virustotal('da39a3ee5e6b4b0d3255bfef95601890afd80709')

VirusTotal's API specifically expects the bare hash. Including the prefix returns an empty result silently — which looks identical to "this file is unknown" and is far more misleading.

Hash collisions are theoretical but not impossible#

SHA-1 has known collision attacks. In a non-adversarial context this is irrelevant — finding a SHA-1 collision against a specific Amcache entry requires effort vastly disproportionate to what an attacker gains. But for high-confidence matching in a high-stakes investigation, do not treat a SHA-1 match as cryptographic identity. Pair with file size, link date, and at least one other field.

Don't trust an IsPeFile = False row's FileId#

Amcache occasionally records FileId values for non-PE files inventoried by the appraiser. The hash is still real, but the context is different — it is hashing whatever the file is (a script, a config file), and downstream tools that assume PE-file context (VirusTotal's PE-aware searches, Yara rules against PE bytes) will return less useful results.

Multiple rows, same hash#

If you find the same Hash value across multiple *_UnassociatedFileEntries.csv rows on the same host, that is meaningful. It means the same binary content was inventoried at multiple paths. Common reasons:

  • The attacker copied a tool into several locations to test which one would execute.
  • A legitimate installer dropped the same DLL into multiple product directories.
  • A user copied a file around manually.

Cluster by Hash, then look at the FullPath set and KeyLastWriteTimestamp for each instance. The timestamps tell you the sequence of copies; the paths tell you the intent.

Multiple hashes, same FullPath#

The opposite pattern — the same path with different Hash values across multiple rows — means the binary at that path changed between inventories. This is a strong signal for binary replacement:

  • Legitimate: a software update overwrote the file.
  • Suspicious: an attacker replaced a system binary or a regularly-run user tool with a trojanised copy.

Sort the rows by KeyLastWriteTimestamp to see when each new hash appeared, then correlate with patch events or Sysmon File Create events around those times.


Pivots that use FileId / Hash#

The pivots that earn their pain on real cases:

Cross-host hash hunting#

# Pivot a known-bad SHA-1 across every host's Amcache CSV
$badHash = 'da39a3ee5e6b4b0d3255bfef95601890afd80709'
Get-ChildItem -Recurse -Filter *_UnassociatedFileEntries.csv |
  ForEach-Object {
    Import-Csv $_.FullName |
      Where-Object { $_.Hash -eq $badHash } |
      Select-Object @{n='Host';e={$_.PSChildName.Split('_')[0]}},
                    FullPath, KeyLastWriteTimestamp, Size
  } |
  Sort-Object Host

This is how you go from "we found this hash on one host" to "tell me every host in the environment that has ever had this binary present, and when it appeared."

VirusTotal enrichment of a CSV#

import csv, requests, time
 
API = 'https://www.virustotal.com/api/v3/files/'
HEADERS = {'x-apikey': '<your-key>'}
 
seen = set()
with open('HOST_amcache_UnassociatedFileEntries.csv', newline='') as f:
    for row in csv.DictReader(f):
        h = row['Hash']
        if not h or h in seen:
            continue
        seen.add(h)
        r = requests.get(API + h, headers=HEADERS)
        if r.status_code == 200:
            stats = r.json()['data']['attributes']['last_analysis_stats']
            if stats.get('malicious', 0) > 0:
                print(h, stats, row['FullPath'])
        time.sleep(15)  # VT public API rate limit

Even a low-volume lookup against the public API yields a tight list of confirmed-bad hashes on a typical infected host.

Correlating with Sysmon Image Loaded events#

Sysmon event ID 7 (Image Loaded) records the SHA-1 of every DLL loaded by every process. Joining Amcache Hash to Sysmon 7's Hashes field tells you exactly which processes loaded a given attacker DLL, and when.


See also#

Want to see the FileId values in your own hive without installing anything? Drop a hive on the parser home page — it parses entirely in your browser.

Related posts

Back to all posts