USDoD
Apr 1, 2024
In April 2024, a large trove of data made headlines as having exposed "3 billion people" due to a breach of the National Public Data background check service. The initial corpus of data released in the breach contained billions of rows of personal information, including US social security numbers. Further partial data sets were later released including extensive personal information and 134M unique email addresses, although the origin and accuracy of the data remains in question.
Data found in this dataset
Source files
Expand any file to inspect its column headers and the LLM's field-mapping reasoning, recorded during ingestion.
ssn2_ab14 columns370,097,699 rows
File structure
Format: CSV·Delimiter: comma·Has header: no·Quote: "
| Source column | Mapped field | Confidence | LLM assessment |
|---|---|---|---|
| 1 | firstName | high | [1] values are common given names (SABRINA, CARRIE) |
| 2 | lastName | high | [2] values are surnames (BIANCHI, MIDDLETON) |
| 3 | middleName | high | [3] values are middle names (LYNN, LEE), position between first and last name columns |
| 5 | dob | high | [5] values match YYYYMMDD date pattern (19710113, 19780309) |
| 6 | address1 | high | [6] values are street addresses with house numbers and street names |
| 7 | city | high | [7] values are city names (OCEAN SPRINGS, JACKSONVILLE, CHICAGO) |
| 9 | state | high | [9] values are 2-letter US state abbreviations (MS, AR, IL, AE) |
| 10 | zip | high | [10] values are 5-digit ZIP codes (39564, 72076, 60645) |
| 11 | phone | high | [11] values are 10-digit phone numbers (4029325087, 7732622608) |
| 12 | fullName | high | [12] values are full names with middle initial (SABRINA N BIANCHI, CARRIE LEE BRUNK) |
| 13 | fullName | high | [13] alternate full name variant (SABRINA LYNN DEMEMBER, CARRIE LEE) — alias/maiden name common in background check data |
| 14 | fullName | medium | [14] third full name variant (CARRIE M BRUNK) — sparse but contains real full name PII |
| 15 | dob | medium | [15] values are YYYYMM format (199505, 201807) — partial date, likely year+month of a significant date in background check context |
| 19 | ssn | high | [19] 9-digit numbers (594481480, 320788124) consistent with SSN format; NPD breach is known to contain SSNs |
Notes: 20 columns total, no header row detected — data appears headerless. 14 contain PII. Column 0 contains sequential numeric IDs (skip). Columns 4, 8, 16, 17, 18 are empty. Multiple fullName columns (12–14) represent name aliases/variants typical of background check aggregator data. Column 15 contains YYYYMM partial dates of uncertain purpose but mapped as dob. Column 19 contains 9-digit SSNs consistent with the National Public Data breach profile.
ssn_aa17 columns65,100,000 rows
File structure
Format: CSV·Delimiter: comma·Has header: yes·Quote: "
| Source column | Mapped field | Confidence | LLM assessment |
|---|---|---|---|
| 1 | firstName | high | [1] header 'firstname', values are uppercase given names like 'AURETTA', 'JUNE' |
| 2 | lastName | high | [2] header 'lastname', values are uppercase surnames like 'TERRY' |
| 3 | middleName | high | [3] header 'middlename', values are middle names/initials like 'JUNE', 'A' |
| 4 | suffix | high | [4] header 'name_suff', name suffix field |
| 5 | dob | high | [5] header 'dob', value '19461201' matches YYYYMMDD date of birth pattern |
| 6 | address1 | high | [6] header 'address', values are street addresses like '6530 DONNA DR' |
| 7 | city | high | [7] header 'city', values are city names like 'ANCHORAGE' |
| 9 | state | high | [9] header 'st', values are 2-letter US state codes like 'AK' |
| 10 | zip | high | [10] header 'zip', values are 5-digit US postal codes like '99504' |
| 11 | phone | high | [11] header 'phone1', phone number field |
| 12 | fullName | high | [12] header 'aka1fullname', full name alias/AKA field |
| 13 | fullName | high | [13] header 'aka2fullname', second full name alias/AKA field |
| 14 | fullName | high | [14] header 'aka3fullname', third full name alias/AKA field |
| 16 | dob | high | [16] header 'alt1DOB', alternate date of birth field |
| 17 | dob | high | [17] header 'alt2DOB', second alternate date of birth field |
| 18 | dob | high | [18] header 'alt3DOB', third alternate date of birth field |
| 19 | ssn | high | [19] header 'ssn', values are 9-digit numbers like '574182899' consistent with US Social Security Numbers |
Notes: 20 columns total; 16 contain PII. Column 0 (ID) is an internal record identifier — skipped. Column 8 (county_name) is a geographic/administrative subdivision, not a standard PII field — skipped. Column 15 (StartDat) appears to be a timestamp/date flag — skipped. Columns 12–14 (aka1–3fullname) are AKA/alias full names and mapped as fullName as they contain searchable personal identity data. Columns 16–18 (alt1–3DOB) are alternate DOBs mapped as dob. SSNs appear without hyphens (9 raw digits).
ssn_ab15 columns373,698,020 rows
File structure
Format: CSV·Delimiter: comma·Has header: no·Quote: "
| Source column | Mapped field | Confidence | LLM assessment |
|---|---|---|---|
| 1 | firstName | high | [1] no header (headerless file), values are all-caps given names: JOHN, KAREN |
| 2 | lastName | high | [2] no header, values are all-caps surnames: TRACEY, TREACY |
| 3 | middleName | high | [3] no header, single-letter values (S, A) consistent with middle initials |
| 5 | dob | high | [5] no header, 8-digit values in YYYYMMDD format: 19210410, 19680731 |
| 6 | address1 | high | [6] no header, values are street addresses: 157 SERGEANTSVILLE RD, 13 ROBIN RD |
| 7 | city | high | [7] no header, values are city names: DEMAREST, WEST CALDWELL, DOVER |
| 8 | address2 | medium | [8] no header, values are county names (BERGEN, ESSEX, MORRIS); no county field available, mapped to address2 as geographic subdivision |
| 9 | state | high | [9] no header, 2-letter US state abbreviations: NJ |
| 10 | zip | high | [10] no header, 5-digit US ZIP codes: 08822, 07627, 07006 |
| 12 | firstName | medium | [12] no header, sparse values appear to be given names or surnames (Tracy, Caggiano, Treece); inconsistent mix, likely alternate name field |
| 13 | fullName | high | [13] no header, values are full names with spaces: ' John', ' Karen A', ' Dan W' |
| 14 | fullName | high | [14] no header, values are reversed full names: 'Caggiano Karen' |
| 16 | dob | high | [16] no header, 8-digit YYYYMMDD dates: 19680731, 19600119 — alternate/duplicate DOB field |
| 17 | dob | high | [17] no header, 8-digit YYYYMMDD dates: 19680731 — third DOB variant field |
| 19 | ssn | high | [19] no header, 9-digit numeric values (131054158, 154644580) consistent with US Social Security Numbers; matches NPD breach context |
Notes: File appears headerless (row 0 contains data values, not column labels). 20 columns total; column [0] contains large sequential numeric IDs (500000000+) treated as internal record IDs. Columns [4], [11], [15], [18] are empty/null. Three separate DOB columns ([5], [16], [17]) suggest denormalized or deduplicated source records. Column [19] 9-digit numbers strongly indicate SSNs given National Public Data breach context. County data in [8] mapped to address2 as no county field is available.