22 comments

  • dperfect 2 hours ago
    Nerdsnipe confirmed :)

    Claude Opus came up with this script:

    https://pastebin.com/ntE50PkZ

    It produces a somewhat-readable PDF (first page at least) with this text output:

    https://pastebin.com/SADsJZHd

    (I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

  • chrisjj 4 hours ago
    > it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

    Or worse. She did.

    • eek2121 4 hours ago
      I mean, the internet is finding all her mistakes for her. She is actually doing alright with this. Crowdsource everything, fix the mistakes. lol.
      • helterskelter 2 hours ago
        I wonder if this could be intentional. If the datasets are contaminated with CSAM, anybody with a copy is liable to be arrested for possession.

        More likely it's just an oversight, but it could also be CYA for dragging their feet, like "you rushed us, and look at these victims you've retraumatized". There are software solutions to find nudity and they're quite effective.

      • TSiege 3 hours ago
        This would be funnier if it wasn’t child porn being unredacted by our government
        • PetriCasserole 1 hour ago
          I can't believe what we've become.
          • queenkjuul 40 minutes ago
            Become?
          • nixosbestos 21 minutes ago
            Every second of my political consciousness in the United States has been acutely tinged with the awareness that a bunch of people, across most of the political spectrum live in a constant state of denial. Denial of personal responsibility or culpability. Denial of cognitive dissonance. Denial of any distinct, self-informed morals. Denial of anything but a fear of others. Denial of anything that makes them fearful or uncomfortable or might invite confrontation.

            I've known from the second I started doing debate and FX/DX in highschool, well, let's just say I never thought that the majority of the 2FA-folks would be worth a damn when tyranny really came knocking. Fear of the other as a form of manipulation, and a distraction from class consciousness, has been their literal raison d'état since decades before I was born.

            I guess I was shocked that the President being a convicted rapist and documented child predator would be a bridge too far. But then we re-elected him.

            I believe it. We voted for this. We do nothing in the face of zero actual justice. This is exactly as good as we deserve. And best of all, it certainly doesn't stop here. This is what they chose to not redact. When we know they spent enormous tax-payer hundreds-of-people hours redacting the documents.

            I don't think it's even conspiratorial to say they left stuff in, so they could use it as justification for not releasing the other HALF of the files that haven't been released, even overly censored.

            We deserve this, and the much worse that our apathy has invited.

            • hsuduebc2 5 minutes ago
              I will certainly feel less confident ridiculing conspiracy theories.

              I’d never believe Bill Gates would secretly slip antibiotics into his wife’s cocktail to treat an STI he got from a Russian prostitute on Epstein’s estate.

              But here we are.

      • chrisjj 3 hours ago
        Let's see her sued for leaking PII. Here in Europe, she'd be mincemeat.
        • ISL 2 hours ago
          The US administration is, at present, regularly violating the law and ignoring court orders. Indeed, these very releases are patently in violation of multiple federal laws -- they're simultaneously insufficiently-responsive to meet the requirements of the law requiring the release of the files and fall afoul of CSAM laws by being incompletely redacted.

          The challenge, as we're all experiencing together, is that the law is not inherently self-enforcing.

      • dagi3d 2 hours ago
        the issue is that mistakes can't be fixed in the sense once they are discovered, it doesn't matter if they are eventually redacted
      • rockskon 3 hours ago
        Yeah - they'll take these lessons learned for future batches of releases.
  • bawolff 3 hours ago
    Teseract supports being trained for specific fonts, that would probably be a good starting point

    https://pretius.com/blog/ocr-tesseract-training-data

  • pyrolistical 4 hours ago
    It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

    1. Get an open source pdf decoder

    2. Decode bytes up to first ambiguous char

    3. See if next bits are valid with an 1, if not it’s an l

    4. Might need to backtrack if both 1 and l were valid

    By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

    • bawolff 3 hours ago
      Sounds like a job for afl
  • bushbaba 1 hour ago
    This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf
  • percentcer 4 hours ago
    This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.
    • jjwiseman 2 hours ago
      Or one person types 76 pages. This is a thing people used to do, not all that infrequently. Or maybe you have one friend who will help–cool, you just cut the time in half.
      • wildzzz 43 minutes ago
        Typing 76 pages is easy when it's words in a language you understand. WPM is going to be incredibly slow when you actually have to read every character. On top of that, no spaces and no spellcheck so hopefully you didn't miss a character.
        • ryanSrich 19 minutes ago
          Seems like a job for an LLM
    • WolfeReader 4 hours ago
      You think compelling 76 people to honestly and accurately transcribe files is something that's easy and quick to accomplish.
    • fragmede 4 hours ago
      > Just get 76 people

      I consider myself fairly normal in this regard, but I don't have 76 friends to ask to do this, so I don't know how I'd go about doing this. Post an ad on craigslist? Fiverr? Seems like a lot to manage.

      • Krutonium 3 hours ago
        Amazon Mechanical Turk?
  • ChocMontePy 1 hour ago
    You can use the justice.gov search box to find several different copies of that same email.

    The copy linked in the post:

    https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

    Three more copies:

    https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

    https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

    https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

    Perhaps having several different versions might make it easier.

  • pimlottc 4 hours ago
    Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

    Hmm. Anyone got some spare CPU time?

    • wahern 4 hours ago
      It should be much easier than that. You should should be able to serially test if each edit decodes to a sane PDF structure, reducing the cost similar to how you can crack passwords when the server doesn't use a constant-time memcmp. Are PDFs typically compressed by default? If so that makes it even easier given built-in checksums. But it's just not something you can do by throwing data at existing tools. You'll need to build a testing harness with instrumentation deep in the bowels of the decoders. This kind of work is the polar opposite of what AI code generators or naive scripting can accomplish.
      • cluckindan 3 hours ago
        On the contrary, that kind of one-off tooling seems a great fit for AI. Just specify the desired inputs, outputs and behavior as accurately as possible.
      • pimlottc 2 hours ago
        I wonder if you could leverage some of the fuzzing frameworks tools like Jepsen rely on. I’m sure there’s got to be one for PDF generation.
  • kevin_thibedeau 3 hours ago
    pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

    Followup: pdfimages is 13x faster than pdftoppm

  • velaia 3 hours ago
    Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle
  • Evidlo 56 minutes ago
    I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.
  • FarmerPotato 4 hours ago
    If only Base64 had used a checksum.
    • zahlman 4 hours ago
      "had used"? Base64 is still in very common use, specifically embedded within JSON and in "data URLs" on the Web.
      • bahmboo 3 hours ago
        "had" in the sense of when it was designed and introduced as a standard
  • nubg 2 hours ago
    Wait would this give us the unredacted PDFs?
    • ryanSrich 17 minutes ago
      That's the idea yeah. There are other people actively working on this. You can follow vx-underground on twitter. They're tracking it.
    • poyu 2 hours ago
      I think it's the PDF files that were attached to the emails, since they're base64 encoded.
  • queenkjuul 40 minutes ago
    I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s
  • legitster 3 hours ago
    Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

    Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

    • Spooky23 2 hours ago
      You’re thinking about this as a nerd.

      It’s not a tools problem, it’s a problem of malicious compliance and contempt for the law.

    • derwiki 3 hours ago
      JPEG?
      • legitster 2 hours ago
        That's not really comparable - It needs to be editable and searchable.
      • recursive 53 minutes ago
        Lossy
  • blindriver 3 hours ago
    On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.
    • rapind 1 hour ago
      What they are redacting is pretty questionable though. Entire pages being suspiciously redacted with no explanation (which they are supposed to provide). This is just my opinion, but I think it's pretty hard to defend them as making an honest and best effort here. Remember they all lied about and changed their story on the Epstein "files" several times now (by all I mean Bondi, Patel, Bongino, and Trump).

      It's really really hard to give them the benefit of the doubt at this point.

    • thereisnospork 2 hours ago
      Considering the justice to document ratio that's kind of on them regardless.
  • linuxguy2 4 hours ago
    Love this, absolutely looking forward to some results.
  • eek2121 4 hours ago
    Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

    Cool article, however.

  • iwontberude 4 hours ago
    This one is irresistible to play with. Indeed a nerd snipe.
    • netsharc 4 hours ago
      I doubt the PDF would be very interesting. There are enough clues in the human-readable parts: it's an invite to a benefit event in New York (filename calls it DBC12) that's scheduled on December 10, 2012, 8pm... Good old-fashioned searching could probably uncover what DBC12 was, although maybe not, it probably wasn't a public event.

      The recipient is also named in there...

      • RajT88 4 hours ago
        There's potentially a lot of files attached and printed out in this fashion.

        The search on the DOJ website (which we shouldn't trust), given the query: "Content-Type: application/pdf; name=", yields maybe a half dozen or so similarly printed BASE64 attachments.

        There's probably lots of images as well attached in the same way (probably mostly junk). I deleted all my archived copies recently once I learned about how not-quite-redacted they were. I will leave that exercise to someone else.

  • zahlman 4 hours ago
    > …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

    A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

    • yunnpp 1 hour ago
      Time to flex those Leetcode skills.
  • prettywoman 4 hours ago
    [dead]