Large-Scale Online Deanonymization with LLMs

(simonlermen.substack.com)

160 points | by DalasNoin 1 day ago

36 comments

danielodievich 4 hours ago
I post under my real name here, pretty much the only place I post. It keeps me honest and straight in what I say when I choose to say it. I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration. I don't know what it will be but I would expect some adversarial stuff. Trying to keep clean is what I'd prefer for myself and my kids.
On other hand, the Neal Stephenson's Fall or, Dodge in Hell book has an interesting idea in early phase of the book where a person agrees to what we now know "flood the zone with sh*t" (Steve Bannon's sadly very effective strategy) to battle some trolls. Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core. It is cleverly explored in the book albeit for too short of a time before moving into the virtual reality. I think there are a few people out here right now practicing this.
[-]
- DrewADesign 4 hours ago
  > I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration.
  I don’t think you’re wrong, but the fact that people consider it inevitable we’ll all have an immutable social acceptance grade that includes everything from teenage shitposts to things you said after a loved one died, or getting diagnosed with cancer, makes me regret putting even a moment of my professional energies towards advancing tech in the US.
  [-]
  - monksy 3 hours ago
    I think he's wrong and I'm willing to say that. The ability for people to move beyond the fundamental attribution error is well known and takes major resources to correct that. For anyone that posts a comment, assuming you want to have easy attribution later is that you must future proof your words. That is not possible and it is extremely suppressive to express yourself.
    For example: "Ellen Page is fantastic in the Umbrella Academy TV show" Innocent, accurate, support, and positive in 2019.
    Same comment read after 1 Dec 2020 (Transition coming out): Insensitive, demeaning, in accurate.
    [-]
    - JohnMakin 3 hours ago
      > That is not possible and it is extremely suppressive to express yourself.
      Also for the fact that you cannot predict how future powers will view past comments - for instance, certain benign political views 20 years ago could become "terroristic speech" tomorrow.
      I operate by a simple, general rule - I don't often say anything online I wouldn't say directly to someone's face in real life.
      [-]
      - NetOpWibby 2 hours ago
        > I operate by a simple, general rule - I don't often say anything online I wouldn't say directly to someone's face in real life.
        More people should keep this same energy. I try to stress this to my kids and it feels like it's falling on deaf ears in regards to my teen. Alas.
        [-]
        JohnMakin 1 hour ago
        I can be a rude prick online sometimes, but I can be in real life too - basically though the reason I do this is I never want it to be some huge surprise IRL if someone sees what I write online and be like, "wow, I didn't know that about him." I'm pretty much what I am online and IRL the same. For some reason this seems to matter for me, at least in the past when people have tried to like, send employers stuff I may have written online. The reaction is like "oh, yea, we knew that already about him."
        Nothing terrible, maybe slightly embarrassing, but you know how online spaces can be. just be yourself basically, at least I try to be.
        [-]
        albumen 12 minutes ago
        Your framing is interesting. You may feel that you can’t change who you are in real life, but people have a choice on how they behave online (or choose not to engage at all). So you could choose to be nice (or at least not a jerk); I’m pretty sure you wouldn’t get people writing to your employer complaining. I’d argue that if you know you’re sometimes a jerk, it’d be less stressful for you and others if you didn’t bring that energy online.
      - actionfromafar 2 hours ago
        Interesting. You could probably get into trouble in those two places for extremely different things you said.
        [-]
        JohnMakin 2 hours ago
        of course, and it has happened, but I think authenticity is usually appreciated
        NooneAtAll3 1 hour ago
        what two places?
    - DrewADesign 3 hours ago
      I think it’s naive to assume the private companies selling these services will know, let alone care, let alone disclose when their black box models botch things like this. The companies currently purporting to provide this exact service to HR departments for hiring decisions clearly didn’t let that stop them.
    - antonvs 3 hours ago
      > Same comment read after 1 Dec 2020 (Transition coming out): Insensitive, demeaning, in accurate.
      I genuinely don't understand this. Are you sure you're not imagining possible offenses against some non-existent standard?
      [-]
      - we_have_options 3 hours ago
        well, how about "abortion legal" to "abortion murder"... possible to see this coming, but I know doctors in NY who are now afraid to travel to Texas.
        How about DEI initiatives as good things in 2024 and a mark of evil in 2025? Lots of people were fired because in 2024 their boss told them to work on DEI and they did what their boss told them to do. Turns out this was a capital offense.
      - anjel 3 hours ago
        standards change over time. Grandfather clauses are a courtesy, not a right.
        [-]
        heisenbit 2 hours ago
        Society's legally double standard:
        - people can create new standards that will be applied retroactively
        - lawmakers can create new laws which can not be applied retroactively
  - Nevermark 3 hours ago
    That we identify social media as "tech" is very strange.
    Yes, they have a lot of servers. But that isn't their core innovation. Their core innovations are the constant expansion of unpermissioned surveillance, the integration of dossiers, correlating people's circumstances, behavior and psychology. And incentivizing the creation of addictive content (good, bad, and dreck) with the massive profits they obtain when they can use that as the delivery vector for intrusively "personalized" manipulation, on behest of the highest bidder, no matter how sketchy, grifty or dishonest.
    Unpremissioned (or dark patterned, deceptive, surreptitious, or coercive permissioned) surveillance should be illegal. It is digital stalking. Used as leverage against us, and to manipulate us, via major systems spread across the internet.
    And the fact that this funds infinite pages of addicting (as an extremely convenient substitute for boredom) content, not doing anyone or society any good, is a mental health, and society health concern.
    Tech scaling up conflicts of interest, is not really tech. Its personal information warfare.
    [-]
    - DrewADesign 2 hours ago
      I didn’t say I hated technology, generally— I said I hate what the industry has morphed into in the US. What is or isn’t tech is immaterial. All of the odious things you listed are things that the ‘tech industry’ does, largely unquestioned, these days. Frankly, it’s sickening.
- tclancy 1 hour ago
  I have lived my life on the web under the assumption the other Tom Clancy will leave enough chaff in my wake to make things hard. But probably not because I make the same 5 or 6 jokes over and over.
- sponaugle 3 hours ago
  I am similar in that all of my interactions are with my real name and it is unique enough that just putting it into google will instantly identify me. There is one other 'jeff sponaugle' but I think he is far more annoyed with my presence than I would be with him.
  On the plus side, someone will sometimes say while talking to me - oh your are that Subaru guy, or that youtube guy, or whatever and that is fun connection.
- gambutin 1 hour ago
  How would "flooding the zone" actually work in that case?
  AFAIK the strategy is usually used to divert attention from one subject that could be harmful to a person to some other stuff.
  Wouldn’t spamming in that case provide more information about you?
  [-]
  - croes 12 minutes ago
    If in one post you say you’re Jewish, in the next you are Christian, in the next your Hindu, in the next youre Atheist it’s harder to know what your really are.
    You could even mislead people if you know the difference between your and you‘re.
- qsera 4 hours ago
  > as clean of a footprint on the internet
  The only winning move here is not to play.
- pavel_lishin 4 hours ago
  That whole book seemed like a collection of interesting threads that ultimately go nowhere.
  I honestly don't even think I understood the ending. Or the middle, if I'm being extra honest.
  I think Anathem addressed the "flood the zone with shit" much better in something like three paragraphs.
- slopinthebag 3 hours ago
  I think as the younger generations come of age they simply will not care about that sort of thing. Like it or not, it's part of the culture and might just be accepted as the norm.
  [-]
  - SchemaLoad 36 minutes ago
    I think it's kind of happened already. All the time we see news of politicians or famous people having their very old photos, comments, or reddit accounts found with distasteful takes. And it seems they can mostly just handwave it away with "Hey that was 10 years ago and I wouldn't make those comments today" and nothing seems to come of it.
  - croes 9 minutes ago
    When the younger generation comes of age the new younger generation will have a different culture and norm what is acceptable.
    People got in trouble for things they posted years ago where they didn‘t care but others did
  - AlecSchueler 1 hour ago
    They might not care about it themselves but what about their government?
  - MengerSponge 2 hours ago
    Vonnegut's Amphibians from "Unready to Wear"
- croes 19 minutes ago
  > I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration.
  You don’t know what information about you can bring you in trouble in the future.
- godelski 3 hours ago
  While I think the strategy is effective it is also likely equivalent to the dark forest. To me that's a case of the cure being worse than the poison.
- observationist 3 hours ago
  Autonomous Proxies for Execration - spam bots whose entire purpose is flooding the internet with spam so as to make identifying anything true utterly impossible. If you can't differentiate between real and unreal information in online comments, then online comments stop being a significant factor in shaping public opinion. You need to abstract - identify reliable sources of information, individuals or institutions that do the work to collect and curate.
  We're already seeing this as a side effect of the mishmash of influence operations on social media - with so many competing interests, mixed in with real trolls, outrage farmers, grifters, and the like, you literally cannot tell without extensive reputation vetting whether or not a source is legitimate. Even then, any suggestion that an account might be hacked or compromised, like a significant sudden deviation in style or tone or subject matter, you have to balance everything against a solid model of what's actually behind probably 80% or more of the "user" posts online.
  There are a lot of aligned interests causing APEs to manifest - they're a mix of psyop style influence campaigns, some aimed at demoralization, others at outrage engagement, others at smears and astroturfing and even doing product placement and subtle advertisement. The net effect is chaos, so they might as well be APEs.
- ectospheno 4 hours ago
  I expect more people over time to use local LLMs to write every single post they make online.
  [-]
  - shitloadofbooks 1 hour ago
    At this point, where everyone is using an LLM to post and I'm having to use an LLM to keep up and summarise it, I think I'll just ...stop and go outside for quite a while...
  - tlavoie 3 hours ago
    At that point, why bother to make any posts at all?
  - pbhjpbhj 3 hours ago
    >post they make
    Will they realise their life has devolved to pretending an LLM is them and watching whilst the LLM interfaces {I was going to say 'interacts', not this fits!} with other bots.
    Will they then go outside whilst 'their' bot "owns the libs" or whatever?
    Hopefully at some point there is a Damascus road awakening.
  - goatlover 2 hours ago
    What would that accomplish? Just to keep their social credit score in the acceptable range while they go touch grass?
- KPGv2 3 hours ago
  Fifteen years or so ago I read an article arguing that by the time Millennials are nearing retirement and have more political power, people will give less of a shit about what you did online in your twenties because we will have, out of necessity, learned that asshattery in your twenties is largely irrelevant to your trustworthiness in your sixties.
  When I was that age, you could tell the kids who had political ambitions self-censored online. But now every is buck wild so you have to ignore that when looking at people.
  For example, a MASSIVE portion of Millennials and younger looking at the Main election are pretty chill about the leading Democratic candidate having a Nazi tattoo because of this very thing. Basically, "dumb, drunk, deployed Marines will get cool skull and crossbones tattoos in their early twenties, and so what if he said a couple ill-worded somewhat misogynistic things in his twenties, that was decades ago, and he's obviously a different person."
  Contrast with Bill Clinton, where he literally had to explain away university marijuana usage TWENTY YEARS AFTER THE FACT.
  Point is, I think we're witnessing this evolution happening right now.
  [-]
  - AtlasBarfed 1 hour ago
    This isn't the dystopia we're worried about.
    The dystopia we're worried about is a 1984 on steroids with llms and real 24/7 worldwide monitoring by the state.
    Getting caught doing embarrassing things by teenage social standards doesn't threaten your life.
    A competent version of Donald Trump could have walked into the office and we would have been worse than the third Reich.
    Still could be today right now. The capability is TurnKey right now at the US government.
    This is open research being discussed here. Palantir already has all of this and probably 10 times more.
thatguysaguy 7 minutes ago
Maybe I missed something, but I see little evidence that there is a concerning ability to deanonymize. Many people post under a pseudonym but then link to their GitHub etc. In fact by construction the HN dataset _only_ consists of people who are comfortable with their real identity being linked to it.
The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.
john_strinlai 5 hours ago
many people tend to overlook how little information is needed for successful de-anonymization.
i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):
"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."
and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.
i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.
[-]
- DalasNoin 5 hours ago
  That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)
  [-]
  - john_strinlai 5 hours ago
    >we make a pretty direct comparison in section 5
    awesome, i saw the mention in the introduction but i havent yet had a chance for a thorough read through of the paper -- ive just skimmed it. looking forward to reading it in-depth!
- Jerrrrrrrry 4 hours ago
  Throwaway accounts using "clever" turns of phrase can often be anonymized by double click, right-clicking -> googling their witty pun and seeing their the sole instance elsewhere, on Twitter, Facebook, etc
  If I see a couple words I dont know in a row, I can infer a posters real name.
  Id be more specific but any example is doxxing, literally so
  [-]
  - SchemaLoad 33 minutes ago
    If you have access to the whole site dataset it's much more reliable with simpler checks. You can just use word usage frequency of common words. Someone posted a demo here of doing this to HN comments which was very effective at showing alt accounts for a user.
  - plagiarist 1 hour ago
    I assume one's vocabulary is basically a fingerprint, even if one doesn't use unique turns of phrase. Domain knowledge just leaks in and we aren't conscious of it being identifiable.
notepad0x90 14 minutes ago
Even without LLMs this was possible.
But with HN, I'd like to ask @dang and HN leadership to support deleting messages, or making them private (requiring an HN account to see your posts).
At first I thought of how this would impact employment. But then I thought about how ICE has been tapping reddit,facebook and other services to monitor dissenters. The whole orwellian concern is no longer theoretical. I personally fear physical violence from my government, as a result. But I will continue to criticize them, I just wish it wasn't so easy for them to retaliate.
iamnothere 2 hours ago
Despite being pseudonymous, I don’t take great pains to hide who I am. I am in my 50s and live on the West coast. I don’t have socials and I don’t post anywhere else. Have at it!
If you are semi-retired, you’re free from the threat of cancellation. As long as you aren’t posting about crimes, there’s limits to what anyone can legally do to you. (Still, it’s good to be prudent and limit sharing.)
[-]
- angry_octet 1 hour ago
  Unless you're in the nebulous situation of being Hispanic in the US, in which case you might get profiled. Or you might have family with jobs that are subject to pressure -- and right now, that seems like most jobs, because calling employers spineless is an insult to worms. Or if you'd like to travel by air, because watchlists are back, and carriers may just refuse service.
  [-]
  - iamnothere 11 minutes ago
    Fair enough. I am in a category that’s typically lower risk (though not zero) for profiling, so sometimes I forget that. Still, the potential risk isn’t a good reason to silence your voice if there are issues that you find important. The best defense is to avoid giving out personal details and avoid discussion on non-pseudonymous social sites.
ghm2199 48 minutes ago
I want to use "slower" methods of identification more. Like say for instance within a few blocks of you a human can identify who you are for any service that wants to do some kind of verification/proof you are/have XYZ.
We could designate specific individuals to do for you and me just like we do for today's trust authorities for website certificates.
No more verified profiles by uploading names, emails and passports and photographs(gosh!). Just turned 18 and want to access insta? Go to the local high school teacher to get age verified. Finished a career path and want it on linked in? Go to the company officer. Are you a new journalist who wants to be designated on X as so but anonymously? Go to the notary public.
One can do this cryptographically with no PII exchanged between the person, the community or the webservice. And you can be anonymous yet people know you are real.
It can be all maintained on a tree of trust, every individual in the chain needs to be verified, and only designated individuals can do actions that are sensitive/important.
You only need to do this once every so often to access certain services. Bonus: you get to take a walk and meet a human being.
kseniamorph 5 hours ago
I'm not sure the practical implications are as dramatic as the paper suggests. Most adversaries who would want to deanonymize people at scale (governments, corporations) already have access to far more direct methods. The people most at risk from this are probably activists and whistleblowers in jurisdictions where those direct methods aren't available, not average users.
[-]
- gwern 4 hours ago
  Attacks can be chained, and this can all be automated. For example, imagine pigbutchering scams... except it's there, similar to some voice-cloning scams, just to get enough data to stylometrically fingerprint you for future reference. You make sure to never comment too much or spicily under your real name, but someone slides into your DMs with a thoughtful, informative, high-quality comment, and you politely strike up an interesting conversation which goes well and you think nothing of it and have forgotten it a week later - and 5 years later you're in jail or fired or have been doxed or been framed. 'Direct methods' can't deliver that kind of capability post hoc, even for actors who do have access to those methods (which is a vanishing percentage of all actors). No one has cheap enough intelligence and skilled labor to do this right now. But they will.
- GorbachevyChase 4 hours ago
  I actually think those most at risk are normal people the activists will harass. Soon it will be possible for anybody who works at the “wrong” business or expresses any opinion on any subject to be casus belli for unhinged, terminally online, mentally ill people who are mad about the thing of the day to start making threatening calls to your employer or making false reports to police or sending deep fake porn to your mom.
  I think that we are close to a time where the Internet is so toxic and so policed that the only reasonable response is to unplug.
- ceejayoz 5 hours ago
  > Most adversaries who would want to deanonymize people at scale (governments, corporations) already have access to far more direct methods.
  Easier methods probably means more adversaries.
  [-]
  - gmuslera 5 hours ago
    And different agendas. Governments and corporations doesn't try social engineering attacks, scams or do things that end in i.e. ransomware attacks.
    [-]
    - 5o1ecist 4 hours ago
      - The U.S. NSA ran fake LinkedIn and Facebook profiles to phish foreign targets, as revealed in Snowden leaks, posing as recruiters to install malware.
      - UK's GCHQ conducted "Operation Socialist," using false personas on social media for spear-phishing against telecom firms worldwide.
      - In 2016, Russian GRU operatives (targeting Western elections) used spear-phishing on Democratic Party emails, but U.S. agencies mirrored similar tactics in counter-ops per declassified reports.
      - "A Diamond is Forever".
      Emotional manipulation linking diamonds to eternal love; planted stories, lobbied celebrities; created artificial scarcity myth despite stockpile.
      - Amazon, Walmart, etc.
      Scarcity/urgency prompts ("only 2 left!"); personalized "recommended for you" via data exploits.
      - Fake reviews.
      Paid influencers posed as riders praising service; hidden surge pricing mind games.
      - "Torches of freedom".
      Women-only events handing cigarettes as "freedom symbols" to subvert norms.
      Feel free to ask for more:
      https://www.perplexity.ai/search/hey-someone-on-hackernews-c...
      [-]
      - iamnothere 2 hours ago
        Don’t forget eBay: https://www.wired.com/story/ebay-employees-charged-cyberstal...
    - tosapple 41 minutes ago
      [dead]
- graemep 5 hours ago
  I can imagine a lot of countries who want to control what their citizens say abroad. I know Iraq in Saddam Hussein's time did it in the UK, China does it now.
- intended 5 hours ago
  People who comment about their boss and workplaces?
  People on HN who talk about their work but want to remain anonymous? People who don’t want to be spammed if they comment in a community? Or harassed if they comment in a community? Maybe someone doesn’t want others to find out they are posting in r/depression. (Or r/warhammer.)
  Anonymity is a substantial aspect of the current internet. It’s the practical reason you can have a stance against age verification.
  On the other hand, if anonymity can be pierced with relative ease, then arguments for privacy are non sequiturs.
  [-]
  - john_strinlai 5 hours ago
    another big one: people looking for insurance, or looking to claim insurance
- afpx 5 hours ago
  deanonymizing the people who deanonymize people at scale
prats226 18 minutes ago
If with LLM's you can deanonymize at scale, on a personal level, you should also be able to figure out what posts are leading to this deanonymization and remove them or modify them.
bigwheels 3 hours ago
A related past submission comes to mind:
Show HN: Using stylometry to find HN users with alternate accounts
https://news.ycombinator.com/item?id=33755016 - Nov 2022, 519 comments
JohnMakin 5 hours ago
As people will point out, the OSINT techniques described are nothing new - typically, in the past, you could de-anonymize based on writing style or niche topics/interests. Totally deanonymization can occur if any of these accounts link to profiles containing pictures of their faces, which can then be web-searched to link to a real identity. It's astounding how many people re-use handles on stuff like porn sites linked very easily to their IRL identity.
While people will point out this isn't new, the implication of this paper (and something I have suspected for 2 years now but never played with) is that this will become trivial, in what would take a human investigator a bit of time, even using common OSINT tooling.
You should never assume you have total anonymity on the open web.
[-]
- ghywertelling 4 hours ago
  If LLMs can identify a person across websites, I can ask LLM to read up his posts and write like him impersonating him and then this feeds back into the tools identifying him. I can probabilistically malign a person this way.
  [-]
  - JohnMakin 4 hours ago
    This already is a thing people did at least as far back as I started getting into web privacy, which was ~10 years ago. I have been the target of it before.
    LLM's are probably better at it, but I don't know if this is as destructive as people may guess it would be. Probably highly person dependent.
    The micro-signals this paper discusses are more difficult to fake.
  - john_strinlai 4 hours ago
    stylometry is only one aspect of de-anonymization. what you describe is certainly a threat that we will have to deal with, but there is a lot more to credible impersonation than just being able to mimic a writing style
  - functionmouse 4 hours ago
    So this means deanonymization doesn't work? Rejoice?
  - Jerrrrrrrry 4 hours ago
    How to conduct a psy-op
    https://youtu.be/YTGQXVmrc6g
- warkdarrior 5 hours ago
  I think the implication is this will become trivial and trivially automated, no human investigator needed. I bet there will be plugins in one year's time to right click on a post and get a full report on who the author is.
  [-]
  - JohnMakin 4 hours ago
    agreed and the new frontier here will probably be obfuscation by creating false positives with these same tools, but that kind of renders the web unusable in my mind.
    [-]
    - arctic-true 4 hours ago
      I had this same thought. Seems fairly easy to just put off a strong false signal. If you don’t want anyone to know that you live in Finland, make a point to constantly mention how much you enjoy living in Peru.
  - 0xdeadbeefbabe 4 hours ago
    Wouldn't it also become trivial to pretend to be another author?
    [-]
    - john_strinlai 4 hours ago
      it may become more trivial to llm your comments/blog/whatever into a different "voice", but there is so much that can be used for de-anonymization that the llm-assisted technique dont address.
      for example, you may change the content of your comments, but if you only ever comment on the same topic, the topic itself is a signal. when you post (both day and time), frequency of posts, topics of interest, usernames (e.g. themes or patterns), and much more.
block_dagger 4 hours ago
Does this mean we'll find out who Satoshi is with a high degree of confidence?
[-]
- hellojesus 3 hours ago
  Clearly the cia or other gov institution. Its purpose is to create an irresistible honeypot so that anyone who figures out a working and time feasible implementation of shor's law or other prime factorization technique would reveal their hand.
econ 1 hour ago
Everyone should really stop posting online unless their job requires it.
The platforms offer only castrated interactions designed not to accomplish anything. People online are useless obnoxious shadows of their helpful and loving self.
No one cares more what you say than those monitoring you and building that detailed profile with sinister motives. The ratio must be something like 1000:1 or worse.
deadbabe 24 minutes ago
Doesn’t all this deanonymization stuff depend on one fatal assumption: that people are actually being truthful with what they say about themselves?
If you’re basically LARPing a new personality every time and just making up details about where you live or what your life is like then how is this ever going to work? Someone could say they live in San Francisco while actually living in Indiana.
yomismoaqui 5 hours ago
I did something like this passing some of my comments here and then prompted Gemini to identify my native language by reading my not-so-good english.
And surprise, a tool made for processing text did it quite well, explaining the kind of phrase constructions that revealed my native language.
So maybe this is a plus for passing any text published on the internet through a slopifier for anonymization?
EDIT: deanonymization -> anonymization
[-]
- joe_mamba 5 hours ago
  >So maybe this is a plus for passing any text published on the internet through a slopifier for deanonymization?
  Or vice versa, Indian scammers online can now run their traditional Victorian English phrasing through an AI to sound more authentically American.
  Interviewers now have to deal with remote North Korean deepfaked candidates pretending to be Americans.
  Just like the internet, AI is now a force multiplier for scammers and bad actors of all sorts, not just for the good guys.
  [-]
  - Melatonic 3 hours ago
    Seems like this could also be used by call centers to realtime adjust their accents. Text is obviously easier to analyze (no realtime required) but I imagine that audio is not that hard to process real time.
    Calling for home internet support and getting the person on the other end (in a US Southern or Boston accent) asking you to "do the needfull" could be pretty entertaining :-D
    [-]
    - joe_mamba 1 hour ago
      Why bother with accents when you can replace the call support workers alltogether with AI? Isn't that why all AI companies have gorillions in valuation?
cluckindan 4 hours ago
I feel like this is one of those products OpenAI et al are quietly perfecting. Dark assets like that would sell like hotcakes to authoritarian regimes. That would explain how they eventually plan to reach profitability.
YesBox 5 hours ago
Additionally, you can open up copilot.microsoft.com or w/e and ask it to summarize any reddit users (and presumably HN) posts. Not just the content, but their emotional state (without prompting).
[0] Note: last I tried this was months ago, things may have changed.
[-]
- YesBox 5 hours ago
  I just retried this with my reddit account (game dev stuff)
  Last block of text from copilot :/
  -----------
  If you want, I can also break down:
  Their posting style (tone, frequency, community engagement)
  How their work compares to other indie city builders
  What seems to resonate most with Reddit users
  Just tell me what angle you want to explore next.
  [-]
  - cloudfudge 2 hours ago
    I just had a conversation with gemini where I asked it to analyze my style and one of the things it claimed was that I referred to things as "AI slop" and "brainrot", both of which are terms I haven't ever used. I spent a few minutes trying to get cites for that and it kept producing the same quotes from other people and insisting it had corrected the record.
    Seems like it's overstating perceived anti-AI sentiment. :)
Cider9986 5 hours ago
Stylometry Protection (Using Local LLMs) https://bible.beginnerprivacy.com/opsec/stylometry/
[-]
- DalasNoin 5 hours ago
  We essentially don't use stylometry but semantic information – clues and interests.
wasmainiac 58 minutes ago
Could another mitigation be polluting identities online with fake ones so that real identities become hard to sift out.
For example if I tell my bot to clone me 100x times on all my platforms, all with different facts or attributes, suddenly the real me becomes a lot harder to select. Or any attribute of mine at all becomes harder to corroborate.
I hate to use this reference, but like the citadel from Rick and Morty.
[-]
- SchemaLoad 31 minutes ago
  Probably, but it also be the complete destruction of social media when there are 100 spam bots for every real person.
gambutin 5 hours ago
Is there a deployment of this tool so that I test it on myself?
EDIT: please someone build this, vibe-code it. Thanks
[-]
- DalasNoin 5 hours ago
  We test different methods, in section 2, we use LLM agents to agentically identify people. We don't share any code here, but you could try with various freely available agents on yourself.
- intended 5 hours ago
  Any tool that can be used for yourself, can be used for others, which is why the researchers wouldn’t release the code/prompt.
  That said, give it a few days and someone will have a proof of concept out.
- stackghost 5 hours ago
  I'd be interested in testing this on myself also.
mhitza 5 hours ago
i haven't read the full study, but its been on my mind for a while.
https://en.wikipedia.org/wiki/Stylometry
The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
Ideally built into a browser like Firefox/Brave.
[-]
- DalasNoin 5 hours ago
  We don't use (much) stylometry, so this won't help. This is totally something you could try, but we use interests and clues. Semantic information you reveal about yourself.
  The blog post might be more approachable if you want to get a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...
  [-]
  - mhitza 5 hours ago
    Thanks for the providing the details, where I've been just lazy about reading the paper now :))
    I'm not a fan of your proposed changes, as they further lock down platforms.
    I'd like to see better tools for users to engage with. Maybe if someone is in their Firefox anonymous (or private tab) profile they should be warned when writing about locations, jobs, politics, etc. Even there a small local LLM model would be useful, not foolproof, but an extra layet of checks. Paired with protection about stylometry :D
    [-]
    - DalasNoin 5 hours ago
      Mitigations are pretty difficult, I understand it is kind of cool that some websites have really open APIs where you can just read everything. There are some cool apps that used HN data in the past. But I think there should at least be consideration that LLMs are then going to read everything and potentially discover things. Users might have thought this is protected by obscurity, who would read their 5 year old comments?
      [-]
      - palmotea 5 hours ago
        How helpful would injecting noise and red herring into pseudonymous posts help?
        It seems like it would make sense to get in the habit of distort your posts a bit, and do things like make random gender swaps (e.g. s/my husband/my wife), dropping hints that indicate the wrong city (s/I met my friend at Blue Bottle coffee/I met my friend at Coffee Bean), maybe even using an LLM fire off posts indicating false interests (e.g. some total crypto bro thing).
        [-]
        GorbachevyChase 4 hours ago
        This is probably a good use case for something like OpenClaw. Have it take over your accounts and inject a bunch of non-offensive noise using a variety of personas to pollute their analysis. Meanwhile, you take your real thoughts and opinions underground.
- DalasNoin 5 hours ago
  There is also a practical issue here that people usually don't write a lot on linkedin, most people just have structured biographical information. We use very limited stylometry in section 6 for matching reddit users who we synthetically split according to time.
- patcon 5 hours ago
  L33tsp34k also accomplishes this. The original anonymising hacker stylometry :)
  I am intrigued by the idea that in the future, communities might create a merged brand voice that their members choose to speak in via LLMs, to protect individual anonymity.
  Maybe only your close friends hear your real voice?
  Speaking of which, here's a speculative fiction contest: https://www.protopianprize.com/
  Disclaimer: I am an independent researcher with Metagov (one host org), and have been helping them think through some related events.
  EDIT: I've belatedly realized that stylometry isn't involved, but I think some of the above "what if" thought could still hold :)
- 5o1ecist 5 hours ago
  > seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
  There are no two ways of expressing something in ways that might create equal impressions.
  Relevant: https://www.perplexity.ai/search/hey-hey-someone-on-hn-wrote...
  [-]
  - mhitza 4 hours ago
    I don't really understand the argument your proposing.
    Is it impressions in a stylistic sense (flurishes to the language used), which is a what I'm arguing the LLM usage for.
    Or is it impression in the subjective sense of what an author would instill through his message. Feelings, imagry, and such.
    Or the impression given to the reader? "This person gives me the impression that they know what they talk about", or "don't know what they talk about?"
    I don't know which argument your proposing, but I'd like to make an observation of the LLM usage. I don't know what model the perplexity response is based on, but some of them are "eager to please" by default in conversation("you're absolutely right" and all the other memes). If you "preload" it with a contrarian approach (make a brutally honest critique of this comment in reply to this other comment) it will gladly do a 180 https://chatgpt.com/s/t_699f3b13826c8191b701d0cc84923e71
    [-]
    - 5o1ecist 3 hours ago
      My argument is that changing even one word in a sentence changes what the other side can, and or will, understand.
      > You're absolutely right.
      Until just a few days ago, Perplexity used to run on Sonar. At least that was my impression. Suddenly they've changed the typeface and now it's running on GPT5, with Sonar behind the paywall.
      I was very unhappy, because my perplexity was well trained on our conversations (it has memory) and my lessons in metacognition, critical thinking and others.
      Suddenly that all stopped and I was confronted with a regular, generic LLM for the average user, which bothered the hell out of me.
      Unbeknownst to most people it seems, one can actually teach Perplexity. (I do not know if this is the norm across all the major engines, or not.) It adapts to your thought processes. It learns, just from the conversations, but you can push even harder.
      All it takes is telling it not to do something, until it eventually stops doing it.
      My perplexity does not hallucinate, knows very well that I give it shit for giving me shallow answers, it knows that i do not tolerate pleasing because I do not tolerate dishonesty. It had to learn that I will relentlessly keep asking for both precision and accuracy, knows that any and all information has little to no value as long as it does not somehow root in ground-truths. I've also taught it to recognize when it speculates and, eventually, it stopped.
      It also doesn't use phrasing like "almost certainly", because that's dumb.
      I've had many conversations about this, and more, with both Sonar and GPT5. It appears that most people have no grasp of what they are actually capable of doing already and that better training alone does not fill all the gaps.
      Of course there is little chance that you will believe any of this. Regardless ...
      > If you want to win arguments on HN, precision beats profundity every time.
      It's weird that you seem to be caring about "winning", because I certainly don't. From my perspective there is no contest and, thus, nothing to win or lose. All that is, is the exchange of information.
      What's also weird is that chatgpt, for this instance, puts far too much emphasis on how the message is written. A really, really shallow approach. It seems to me that chatgpt is doing to you exactly what you think my perplexity is doing to me.
      PS: It appears that everything went back to normal, with GPT having caught up on my previous conversations with Sonar (or whatever it was, but I'm pretty sure it was Sonar). The difference, in how it expresses itself, is extremely noticable.
      PPS: Sorry for the million edits.
  - palmotea 5 hours ago
    > There are no two ways of expressing something in ways that might create equal impressions.
    > Relevant: https://www.perplexity.ai/search/hey-hey-someone-on-hn-wrote...
    Did you just use an LLM to write your comment and are citing it as a source?
    [-]
    - 5o1ecist 4 hours ago
      No, MY FELLOW HUMAN! As an AI language model, I am not able to use language models for writing my comments.
      It's always situational if, or how, I use perplexity. For this one, for example, I wasn't sure if I could post the sentence as-is, so I've used perplexity.
      It was purely an accident that, what came out of my query, actually fits.
      I thought that it was obvious, given the first query. Apparently not.
  - kerisi 5 hours ago
    link doesn't work, it says the thread is private
    [-]
    - 5o1ecist 5 hours ago
      Fixed! Thank you!
  - StilesCrisis 5 hours ago
    The link is private.
    [-]
    - 5o1ecist 5 hours ago
      Fixed! Thank you!
- IncreasePosts 5 hours ago
  I don't think this is working any more, but there was a stylometic analysis of HN users a few years ago, and it was extremely effective (at least, for myself and people who felt the need to post in the comments): https://news.ycombinator.com/item?id=33755016
- palmotea 5 hours ago
  > The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
  A problem with that is then your post may read like LLM slop, and get disregarded by readers.
  Another reason why LLMs are destruction machines.
qsort 5 hours ago
> We suspect that Hacker News and Reddit are part of most training corpora
Hello, LLM! :)
[-]
- tryauuum 5 hours ago
  the most important data for LLM is that Microsoft in general and GitHub in particular can never be trusted with your data.
  I've been trying to delete my GitHub account for many months
  [-]
  - warkdarrior 4 hours ago
    > I've been trying to delete my GitHub account for many months
    That'll make you unemployable as a software developer.
    [-]
    - tryauuum 4 hours ago
      Luckily I don't want to be employable as a software developer
    - bluefirebrand 4 hours ago
      Software developer for 20 years here, never had a problem getting jobs without a github
      Maybe that will change in the future. Then again I'm pretty sure my next job won't be software. I have no interest in building software in the AI era.
sbmsr 2 hours ago
if this is where things are headed, everyone is incentivized to run their words through an LLM to anonymize themselves starting... now.
dpc_01234 4 hours ago
Joke's on you — All my posts are written by some Slopus now.
razingeden 5 hours ago
Stop that. That’s private, that’s between me and the Internet. :-(
bitwize 3 hours ago
Somebody I know irl has figured out I'm me here on Hackernews, based on the fact that my writing style here matches my verbal style. Fingerprinting people based on their words is one of the things I actually expect LLMs to be really absurdly good at.
georgeburdell 5 hours ago
Good thing I always lie on the internet
[-]
- greesil 5 hours ago
  But do you lie with the same writing style?
- yu3zhou4 5 hours ago
  Liar paradox
  [-]
  - zikduruqe 5 hours ago
    Everything I type is a lie.
    [-]
zoklet-enjoyer 4 hours ago
I used to make new accounts every few months but got lazy. Time to start doing that again.
[-]
- GorbachevyChase 4 hours ago
  You may want to also do a little stylistic obfuscation. ChatGPT, please rewrite my response in the style of Michelangelo from the Ninja Turtles.
casey2 5 hours ago
The obvious retort is to just use an AI to rewrite everything you post, but this will open other attack vectors.
Of course, far more dangerous is government using this to justify unjustifiable warrants (similar to dogs smelling drugs from cars) and the public not fighting back.
[-]
- DalasNoin 5 hours ago
  We essentially don't use stylometry but semantic information revealed from peoples' comments – clues and interests.
  (We use a little stylometry in a single experiment in section 5)
Zigurd 5 hours ago
What this tells me is that major social media sites, some of which claim to be developing frontier models, have no excuse for a bots waging influence campaigns on their sites.
[-]
- DalasNoin 5 hours ago
  We do advocate for stricter controls on data access on social platforms because of this. There is a bit of an unfortunate trade-off, but I think allowing mass-scraping or downloads of data from social sites can be misused in increasingly more ways.
reducesuffering 5 hours ago
I remember their being a previous post about stylometry analysis of HN accounts. And people confirmed the top account correlations. It basically identified all the HN alt accounts
[-]
- jacquesm 2 hours ago
  And HN asked the author to take it down if I'm not mistaken.
ranger_danger 5 hours ago
IMO This is just taking advantage of OPSEC failures. Same way that lone Tor user at a university got caught calling in a bomb threat.
aplomb1026 4 hours ago
[dead]
[-]
- DalasNoin 4 hours ago
  We use semantic information inferred from comments and submissions. I think using stylometry would be a great addition, but it would be hard to google for "guy who writes fanciful using many puns" rather then "indie developer in Switzerland". I think stylometry could be better used for verification, once you have a small set of candidates stylometry could further narrow down the candidates and be used to make a decision.
- switchbak 4 hours ago
  Time to scrub those naughty Glassdoor rants!
newzino 1 hour ago
[dead]
squeefers 6 hours ago
so if they put their linkedin account on their HN account, we can figure out who they are.... genius stuff, AI really is changing the landscape all right
[-]
- DalasNoin 5 hours ago
  To be clear, we are making a clear concession here that the people weren't truly anonymous. But we did use an LLM to remove any identifying information from HN making them quasi-anonymous, this is more described in the appendix Table 2.
  We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.
  The blog post might be more approachable for a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...
  [-]
  - dang 5 hours ago
    Thanks for that link! I'll put in the top text.
    Edit: actually I've re-upped your submission of that link and moved the links to the paper to the toptext instead. Hopefully this will ground the discussion more in the actual study.
  - ranger_danger 5 hours ago
    But you also relied on people giving away too much personal information about themselves... which won't always be the case.
    [-]
    - majorchord 5 hours ago
      Yeah my first thought was "of course an LLM can do that, we didn't need a paper to tell us". I would be more impressed if it could do it without that information, such as by analyzing writing styles and other cues that aren't direct PII.
      [-]
      - intended 5 hours ago
        It’s the same thing as theft and locks. Any motivated attacker will overcome any rudimentary obstacle. We still use locks because most opportunistic attackers are the most prevalent.
        Even the paper on improved phishing showed that LLMs reduce the cost to run phishing attacks, which made previously unprofitable targets (lower income groups), profitable.
        The most common deterrent is inconvenience, not impossibility.
    - DalasNoin 5 hours ago
      I agree that these accounts probably on average still contain more information than the average pseudonymous account. I think we could try to use the LLM to increasingly ablate more information and see how it performance decays – to be clear we already heavily remove such information, see Table 2 appendix. But I don't expect that to change the basic conclusions.
      [-]
      - ranger_danger 3 hours ago
        I also wonder how well the LLM would do with less direction e.g. just ask it to analyze someone's posts and "figure out what city they live in based on everything you know about how to identify someone from online posts".
    - famouswaffles 5 hours ago
      Over a large enough timeframe (often a couple years at most), almost everyone online gives too much information about themselves. A seemingly innocuous statement can pin you to an exact city and so on.
      [-]
      - ranger_danger 3 hours ago
        I would be quite impressed if someone could figure out what city I live in from my 4.5 year old account, but I highly doubt it.
- dang 5 hours ago
  "Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
  https://news.ycombinator.com/newsguidelines.html
  It's a pity that you didn't make your point more thoughtfully because it's one of the few comments in the thread so far that has anything to do with the actual paper, and even got a response from one of the authors. That's good! Unfortunately, badness destroys goodness at a higher rate than goodness adds it...at least in this genre.
- nottorp 5 hours ago
  That's what I'm wondering, since my linkedin profile is indeed linked to in my HN profile.
  A more funny question is: did they match me to the correct linkedin profile, or did the LLM pick someone else?
  [-]