Processing data breaches (especially big ones), can be extremely laborious. And, of course, everyone commenting on them is an expert, so there's a heap of opinions out there. And so it was with the latest stealer logs, a corpus of data that took the better part of a month to process. And then I made things confusing in various ways which led to both Disqus comment and ticket hell. But hey, it's finally out and now it's back to normal breach processing for the foreseeable future ๐
I like to start long blog posts with a tl;dr, so here it is:
We've ingested a corpus of 1.5TB worth of stealer logs known as "ALIEN TXTBASE" into Have I Been Pwned. They contain 23 billion rows with 493 million unique website and email address pairs, affecting 284M unique email addresses. We've also added 244M passwords we've never seen before to Pwned Passwords and updated the counts against another 199M that were already in there. Finally, we now have a way for domain owners to query their entire domain for stealer logs and for website operators to identify customers who have had their email addresses snared when entering them into the site. (Note: stealer logs are still freely and easily searchable by individuals, scroll to the bottom for a walkthrough.)
This work has been a month-long saga that began hot off the heels of processing the last massive stash of stealer logs in the middle of Jan. That was the first time we'd ever added more context to stealer logs by way of making the websites email addresses had been logged against searchable. To save me repeating it all here, if you're unfamiliar with stealer logs as a concept and what we've previously done with HIBP, start there.
Up to speed? Good, let's talk about ALIEN TXTBASE.
Origin Story
Last month after loading the aforementioned corpus of data, someone in a government agency reached out and pointed me in the direction of more data by way of two files totalling just over 5GB. Their file names respectively contained the numbers "703" and "704", the word "Alien" and the following text at the beginning of each file:
Pulling the threads, it turned out the Telegram channel referred to contained 744 files of which my contact had come across just the two. The data I'm writing about today is that full corpus, published to Telegram as individual files:
The file in the image above contained over 36 million rows of data consisting of website URLs and the email addresses and passwords entered into them. But the file is just a sample - a teaser - with more data available via the subscription options offered in the message. And that's the monetisation route: provide existing data for free, then offer a subscription to feed newly obtained logs to consuming criminals with a desire to exploit the victims again. Again? The stealer logs are obtained in the first place by exploiting the victim's machine, for example:
How do people end up in stealer logs? By doing dumb stuff like this: โAround October I downloaded a pirated version of Adobe AE and after that a trojan got into my pcโ pic.twitter.com/igEzOayCu6
So now this guy has malware running on his PC which is siphoning up all his credentials as they're entered into websites. It's those credentials that are then sold in the stealer logs and later used to access the victim's accounts, which is the second exploitation. Pirating software is just one way victims become infected; have a read of this recent case study from our Australian Signals Directorate:
When working from home, Alice remotely accesses the corporate network of her organisation using her personal laptop. Aliceย downloaded, onto herย personal laptop, a version of Notepad++ from a website she believed to be legitimate. Anย info stealerย was disguised as the installer for the Notepad++ software.
When Alice attempted to install the software, the info stealer activated and beganย harvesting user credentialsย from her laptop. This included her work username and password, which she had saved in her web browserโs saved logins feature. The info stealer then sent those user credentials to a remote command-and-control server controlled by a cybercriminal group.
Eventually, data like Alice's ends up in places like this Telegram channel and from there, it enables further crimes. From the same ASD article:
Stolen valid user credentials are highly valuable to cybercriminals, because they expedite the initial access to corporate networks and enterprise systems.
So, that's where the data has come from. As I said earlier, ALIEN TXTBASE is by no means the only Telegram channel out there, but it is definitely a major distribution channel.
Verification
When there's a breach of a discrete website, verification of the incident is usually pretty trivial. For example, if Netflix suffered a breach (and I have no indication they have, this is just an example), I can go to their website, head to the password reset field, enter a made-up email address and see a response like this:
On the other hand, an address that does exist on the service usually returns a message to the effect of "we've sent you a password reset email". This is called an "enumeration vector" in that it enables you to enumerate through a list of email addresses and find out which ones have an account on the site.
But stealer logs don't come from a single source like Netflix, instead they contain the credentials for a whole range of different sites visited by people infected with malware. However, I can still take lines from the stealer logs that were captured against Netflix and test the email addresses. (Note: I never test if the password is valid, that would be a privacy violation that constitutes unauthorised access and besides, as you'll read next, there's simply no need to.)
Initially, I actually ran into a bit of a roadblock when testing this:
I found this over and over again so, I went back and checked the source data and inspected this poor victim's record:
Their Netflix credentials were snared when they were entered into the website with a path of "/ph-en/login", implying they're likely Filipino. Let's try VPN'ing into Manilla:
And suddenly, a password reset gives me exactly what I need:
That's a little tangent from stealer logs, but Netflix obviously applies some geo-fencing logic to certain features. This actually worked out better than expected verification-wise because not only was I able to confirm the presence of the email address on their service, but that the stealer log entry placing them in the Philippines was also geographically correct. It was reproducible too: when I saw "something went wrong", but the path began with "mx", I VPN'd into Mexico City and Netflix happily confirmed the reset email was sent. Another path had "ve", so it was off to Caracas and the Venezuelan victim's account was confirmed. You get the idea. So, strong signal on confirmation of account existence via password reset, now let's also try something more personal.
I emailed a handful of HIBP subscribers and asked for their support verifying a breach. I got a fast, willing response from one guy and sent over more than 1,100 rows of data against his address ๐ฒ It's fascinating what you can tell about a person from stealer log data: He's obviously German based on the presence of websites with a .de address, and he uses all the usual stuff most of us do (Amazon, eBay, LinkedIn, Airbnb). But it's the less common ones that make for more interesting reading: He drives a Mercedes because he's been logging into an address there for owners, and it also appears he likes whisky given his account at an online specialist. He's a Firefox user, as he's logged in there too, and he seems to be a techie as he's logged into Seagate and a site selling some very specialised electrical testing equipment. But is it legit?
Imagine the heart-in-mouth moment the poor guy must have had seeing his digital life laid out in front of him like that and knowing criminals have this data. It'd be a hell of a shock.
Having said all that, whilst I'm confident there's a large volume of legitimate data in this corpus, it's also possible there will be junk. Fabricated email addresses, websites that were never used, etc. I hope folks who experience this can appreciate how hard it is for us to discern "legitimate" stealer logs from those that were made up. We've done as much parsing and validation as possible, but we have no way of knowing if someone@yourdomain.com is an email address that actually exists or if it does, if they ever actually used Netflix or Spotify or whatever. They're just strings of data, and all we can do is report them as found.
Searching Entries Against Your Website with a Pwned 5 Subscription
When we published the stealer logs last month, I kept caveating everything with "experimental". Even the first word of the blog post title was "experimenting", simply because we didn't know how this would be received. Would people welcome the additional data we were making available? Or find it unactionable noise? It turns out it was entirely the former, and I didn't see a single negative comment or, as it often has been in the past with stealer logs or malware breaches, angry victims demanding I send their entire row(s) of data. And I guess that makes sense given what we made available so, starting today, we're making even more available!
First, a bit of nomenclature. Consider the following stealer log entry:
There are four parts to this entry that are relevant to the HIBP services I'm going to write about here:
ย
ย
ย
Email Address
ย
ย
Website Domain
ย
Email Alias
ย
Email Domain
ย
https://
www.netflix.com
/en-ph/login
john
@
example.com
P@ssw0rd
Last month, we added functionality to the UI such that after verifying your email address you could see a collection of website domains. In the example above, this meant that John could use the free notification service to verify control of his email address after which he'd see www.netflix.com listed. (Note: we're presently totally redesigning this as part of our UX rebuild and it'll be much smoother in the very near future.) Likewise, we introduced an API to do exactly the same thing, but only after verifying control of the email domain. So, in the case above, the owner of example.com would be able to query john@example.com and get back www.netflix.com (along with any other website domains poor John had been using).
Today, we're introducing two new APIs, and they're big ones:
Query stealer logs by email domain
Query stealer logs by website domain
The first one is akin to our existing domain search feature so in the example above, the owner of the domain could query the stealer logs for example.com and get back each email address alias and the website domains they appear against. Here's what that output looks like:
The previous model only allowed querying by email address, so you could end up with an organisation needing to iterate through thousands of individual API requests. This model means that can now be done in a single request, which will make life much easier for larger organisations assessing the exposure of their workforce.
The second new API is designed for website operators who want to identify customers who've had their credentials snared at login. So, in our demo row above, Netflix could query www.netflix.com (after verifying control of the domain, of course) and retrieve a list of their customers from the stealer logs:
[
"john@example.com",
"jane@yahoo.com"
]
Both these new APIs are orientated towards larger organisations and can return vast volumes of data. When considering how to price this service, the simplest, most commensurate model we arrived at was to use a pricing tier we already had: Pwned 5:
Whilst we'd previously only ever listed tiers 1 through 4, we'd always had higher tiers sitting there in the background for organisations needing higher rate limits. Surfacing this subscription and adding the ability to query stealer logs via these two new APIs makes it easy for new and existing subscribers alike to unlock the feature. And if you are an existing subscriber, the price is simply adjusted pro rata at Stripe's end such that your existing balance is carried forward. Per the above image, this subscription is available either monthly or annually so if you just want to see what's in the current corpus of data and keep the cost down, take it for a month then cancel it. (Note: the Pwned 5 subscription is also now required for the API to search by email address we launched last month, but the web UI that uses the notification service to show stealer log results by email is absolutely still free and will remain that way.)
Another small addition since last month is that we've added an "IsStealerLog" flag on the breach model. This means that anyone programmatically dealing with data in HIBP can now easily elect to handle stealer logs differently than other breaches. For example, a new breach with this flag set to "true" might then trigger a query of the new API endpoints to search by domain so that an organisation can update their records with any new stealer log entries.
Anyone searching by email domain already knows the scope of addresses on their domain as it's reported on their dashboard. Plus, when email notifications are sent on breach load it tells you exactly how many new addresses from your domains are in the breach. Starting today, we've also added a column to explain how many email addresses appear against your website domain:
In other words, 3 people have had their email address grabbed by an info stealer when logging on to hibp-integration-tests.com, and the new API will return all of those addresses. It is only API-based for the moment, we'll consider if a UI makes sense as part of the rebranded site launch, it may not becuase of the potentially huge volumes of data.
Just one last thing: for the two new APIs that query by domain, we've set a rate limit which is entirely independent of the rate limit on, say, breached account searches. Whilst a Pwned 5 subscription would allow 1,000 requests to that API every minute, it's significantly more restricted when hitting those two new stealer log APIs. We haven't published a number as I expect we'll tweak it a bit based on usage, but it's more than enough for any normal use of the service whilst ensuring we don't get completely overwhelmed by high-overhead searches. The stealer log API that queries by email address inherits the 1,000 RPM rate limit of the Pwned 5 subscription.
We've Added 244M New Passwords to Pwned Passwords
One of the coolest most awesome best things we've ever done with HIBP is to create a massive repository of passwords that's all open source (both code and data) and can be queried anonymously without disclosing the searched password. Pwned Passwords is amazing, and it has gained huge traction:
There it is - weโve now passed 10,000,000,000 requests to Pwned Password in 30 days ๐ฎ This is made possible with @Cloudflareโs support, massively edge caching the data to make it super fast and highly available for everyone. pic.twitter.com/kw3C9gsHmB
10 billion times a month, our API helps a service somewhere assist one of their customers with making a good password choice. And that's just the API - we have no idea the full scope of the service as having the data open source means people can just download the entire corpus and run it themselves.
Per the opening para, we now have an additional 244 million previously unseen passwords in this corpus. And, as always, they make for some fun reading:
tender-kangaroo
swimmingkangaroo59
gentlekangaroo
CaptainKangaroo340
And, uh, some kangaroos doing other stuff I can't really repeat here. Those passwords are at the final stages of loading and should flush through cache to Cloudflare's hundreds of edge nodes in the next few hours. That's another quarter of a billion that join the list of previously breached ones, whilst 199 million we'd already seen before have had their prevalence counts upped.
HIBP in Practice
It's amazing to see where my little pet project with the stupid name has gone, and nobody is more surprised than me when it pops up in cool places. Looking around for some stealer log references while writing this blog post, I came across this one:
This was already in place when you created a new account or updated your password. But now it's also verified on every login against the live HIBP database. Hats off to the tremendous service HIBP provides to the internet ๐ https://t.co/Z61AgDaL2t
That's awesome! That's exactly the sort of use case that speaks to the motto of "do good things with breach data after bad things happen". By adding this latest trove of data, the folks using Basecamp will immediately benefit simply by virtue of the service being plugged into our API. And sidenote: David has done some amazing stuff in the past so I was especially excited to see this shout-out ๐
This one is a similar story, albeit using Pwned Passwords:
Their service is phenomenal! We also inform users in our product if they set/change their password to a known password that has been hacked. Admins have the option to not allow for users to use these passwords, if they wish. pic.twitter.com/bvLfYm9xzH
Inevitably, those requests form a slice of the 10 billion monthly we see that are now able to identify a quarter of a billion more compromised ones and hopefully, keep them out of harm's way.
For many organisations, the data we're making available via stealer logs is the missing piece of the puzzle that explains patterns that were previously unexplainable:
Gotta say Iโm pretty happy with what we did with stealer logs last week, think weโre gonna need to do more of this ๐ pic.twitter.com/4rMaMmL8LU
I've had so many emails to that effect after loading various stealer logs over the years. The constant theme I love hearing is how it benefits the good guys and, well, not so much the bad guys:
The introduction of these new APIs today will finally help many organisations identify the source of malicious activity and even more importantly, get ahead of it and block it before it does damange. Whilst there won't be any set cadence to the addition of more stealer logs (obviously, we can't really predict when this stuff will emerge), I have no doubt we'll continue to add a lot more data yet.
Techie Bits
Processing this data has been non-trivial to say the least, so I thought I'd give you a bit of an overview of how I ultimately approached it. In essence, we're talking about transforming a very large amount of highly redundant text-based data and ultimately ending up with a distinct list of email addresses against website domains. Here's the high-level process:
Start with 744 files of logs totalling 1.5TB and containing 23 billion rows
In a .NET console app, process each file in point 1 and extract the domain and email address from valid lines (domain, email and password, all colon delimited) to produce 744 files totalling 390GB (9.7B rows)
In another .NET console app, consolidate all 744 files from the previous point into a single file with a distinct set of website domain and email address pairs (789M rows)
Take the file from the previous point, and in another .NET console app, extract all the unique domains (18M rows)
Use SQL BCP to upload the files from the two previous points to SQL Azure
Insert any new domains that don't already exist in HIBP (these are held in a dedicated table) via a TSQL statement (6.7M rows)
Upload the 284M unique email addresses like with a typical data breach (69% of them were already in HIBP)
Join the distinct list of domains and email addresses to the data uploaded in the previous point and insert email and domain pairs into the stealer log table, but only if they haven't been seen before via another TSQL statement (493M rows)
Wait - if I took distinct website and email pairs in step 4 and got 789M rows then in step 9 it only inserted 493M, what happened? There were 220M rows already in HIBP from last month which will account for some of the gap (existing records aren't reinserted), and there was also some additional validation in SQL Server courtesy of code we only have in that environment. The remaining gap is explained by the .NET code not ignoring case on distinct, so in other words, I dumped and uploaded way more data than I had to and made SQL Sever do extra work ๐คฆโโ๏ธ
Go live and get ๐บ
It was actually much, much harder than this due to the trials and errors of attempting to distil what starts out as tens of billions of rows down into hundreds of millions of unique examples. I was initially doing a lot more of the processing in SQL Azure because hey, it's just cloud and you can turn up the performance as much as you want, right?! After running 80 cores of Azure SQL Hyperscale for days on end ($ouch$) and seeing no end in sight to the execution of queries intended to take distinct values across billions of rows, I fell back to local processing with .NET. You could, of course, use all sorts of other technologies of choice here, but it turned out that local processing with some hand-rolled code was way more efficient than delegating the task to a cloud-based RDBMS. The space used by the database tells a big part of the story:
As I've said many times before, the cloud, my friends, is not always the answer. Do as much processing as possible locally on sunk-cost hardware unless there's a compelling reason not to.
I've detailed all this here in part because that's what I've always done with this project over the last 11 and a bit years, but also to illustrate how much time, effort, and money is burned processing data like this. It's very non-trivial, especially not when everything has to ultimately go into an increasingly large system with loads of external dependencies and be fast, reliable and cost-effective.
Conclusion
From 23 billion raw rows down to a much more manageable and discrete set of data, the latest stealer logs are now searchable via all the ways you've done for years, plus those two new domain-based stealer log APIs. We've called this breach "ALIEN TXTBASE Stealer Logs", let's see what positive outcomes we can all now produce with the data.
Edit 1: Let me re-emphasise an important point from the blog post I think got a bit buried: The web UI that uses the notification service to show stealer log results by email is absolutely still free and will remain that way. If youโve got an email address in this breach and you want to see the stealer log domains against it, do this:
Fill in your email address and send yourself the verification email
Click the link emailed to you and scroll to the bottom of the page where you'll find a message similar to this:
We need to make this clearer as it's obviously confusing. Thanks for everyone's feedback, we're working on it.
Edit 2: I've temporarily turned off comments during the Aussie night to save my inbox from being bombarded while I sleep. Hopefully, you'll find what you're looking for in the existing comments, I'll open things back up in the morning my time.
Wait - it's Tuesday already?! When you listen to this week's (ok, last week's) video, you'll probably get the sense I was a bit overloaded. Yeah, so that didn't stop, and the stealer log processing and new feature building just absolutely swamped me. Plus, I spent from then until now in Sydney at various meetings and events which was great, but didn't do a lot for my productivity. Be that as it may, we're now less than 12 hours off launching this all so, in the interests of not having me stay up all night putting the finishing touches on it, let me drop here and come back in a few days to talk about how it's all been received ๐ค
We're now eyeball-deep into the HIBP rebrand and UX work, totally overhauling the image of the service as we know it. That said, a guiding principle has been to ensure the new looks is immediately recognisable and over months of work, I think we've achieved that. I'm holding off sharing anything until we're far enough down the road that we're confident in the direction we're heading, and then I want to invite the masses to contribute as we head towards a (re)launch.
Whilst I didn't talk about it in this week's video, let me just recap on why we're doing this: the decisions made for a pet project nearly 12 years ago now are very different to the decisions made for a mainstream service with so many dependencies on it today. We're at a point where we need more professionalism and cohesion and that's across everything from the website design and content, the branding on our formal documentation, the stickers I hand out all over the place, the swag we want to make and even the signatures on our emails. Our task is to keep the heart and soul of a humble community-first project whilst simultaneously making sure it actually looks like we know what we're doing ๐
It's IoT time! We're embarking on a very major home project (more detail of which is in the video), and some pretty big decisions need to be made about a very simple device: the light switch. I love having just about every light in our connected... when it works. The house has just the right light early each morning, it transitions into daytime mode right at the perfect time based on the amount of solar radiation in the sky, into evening time courtesy of the same device and then blacks out when we go to bed. And some lights come on with movement based on motion sensors in fans (Big Ass fans have occupancy sensors), cameras (Ubiquiti camera raise motion events), and tiny dedicated Zigbee sensors. But getting the right physical switches in combination with the right IoT relays has been a bit more challenging. Listen to this week's show let me know if you have any "bright" ideas ๐
We're heading back to London! And making a trip to Reykjavik. And Dublin. I talked about us considering this in the video yesterday, and just before publishing this post, we pulled the trigger and booked the tickets. The plan is to pretty much repeat the US and Canada trip we did in September and spend the time meeting up with some of the law enforcement agencies and various other organisations we've been working with over the years. As I say in the video, if you're in one of these locations and are in a position to stand up a meetup or user group session, I'd love to hear from you. Europe is a hell of a long way to go so we do want to make the most of the travel, stand by for more plans as they emerge.
If I'm honest, I was in two minds about adding additional stealer logs to HIBP. Even with the new feature to include the domains an email address appears against in the logs, my concern was that I'd get a barrage of "that's useless information" messages like I normally do when I load stealer logs! Instead, the feedback was resoundingly positive. This week I'm talking more about the logic behind this, some of the challenges we faced with it and what we might see in the future. Stay tuned, because I think we're going to be seeing a lot more of this in HIBP.
It sounds easy - "just verify people's age before they access the service" - but whether we're talking about porn in the US or Australia's incoming social media laws, the reality is way more complex than that. There's no unified approach across jurisdictions and even within a single country like Australia, the closest we've got to that is a government scheme usually intended for accessing public services. And even if there was a technically workable model, who wants to get either the gov or some big tech firm involved in their use of Instagram or Pornhub?! There's a social acceptance to be considered and not only that, circumvention of age controls is very easy when you can simply VPN into another jurisdiction and access the same website blocked in your locale. Or in the case of the adult material, I'm told (๐คทโโ๏ธ) there are many other legally operating websites in other parts of the world that are less inclined to block individuals in specific states from foreign countries. There'll be no easy solutions for this one, but it'll make for an entertaining year ๐
The Glamira data breach made it into HIBP (link through to a Reddit thread where the company acknowledged the breach last year, no word on whether they disclosed to impacted individuals)
There's a certain irony to the Bluesky situation where people are pushing back when I include links to X. Now, where have we seen this sort of behaviour before? ๐ค When I'm relying on content that only appears on that platform to add context to a data breach in HIBP and that content is freely accessible from within the native Bluesky app (without needing an X account), we're out of reasonable excuses for the negativity. And if "because Elon" is the sole reason and someone is firm enough in their convictions on that, there's a very easy solution ๐
I fell waaay behind the normal video cadence this week, and I couldn't care less ๐ I mean c'mon, would you rather be working or sitting here looking at this view after snowboarding through Christmas?!
Christmas Day awesomeness in Norway ๐ณ๐ด Have a great one friends, wherever you are ๐งโ๐ pic.twitter.com/F2FtcJYzRC
That said, Scott and I did carve out some time to chat about the, uh, "colourful" feedback he's had after finally putting a price on some Report URI features he'd been giving away free for years. And there's more data breaches, of course, including a couple I loaded over the previous week that I think were particularly interesting. Enjoy this week's video, next week's will be a 2024 wrap-up from somewhere much, much sunnier ๐
I'm back in Oslo! Writing this the day after recording, it feels like I couldn't be further from Dubai; the temperature starts with a minus, it's snowing and there's not a supercar in sight.
Back on business, this week I'm talking about the challenge of loading breaches and managing costs. A breach load immediately takes us from a very high percentage cache hit ratio on Cloudflare to zero. Consequently, our SQL costs skyrocket as the DB scales to support the load. Approximately 28 hours after loading the two breaches I mention in this week's update, we're still running a DB scale that's 350% larger than once we have a high cache hit ratio, and that directly hits my wallet. We need to work on this more because as I say in the video, I really don't like financial incentives that influence how breaches are handled, such as delaying them and bulking them together to reduce the impact of cache flush events like this. We'll give that more thought, I think there are a few ways to tackle this. For now, here's this week's video and some of the challenges we're facing:
Some people really don't like supercars (although I suspect it's more about not liking to see either the enjoyment others take in them or the success they may have achieved)
I think what's really scratching an itch for me with the home theatre thing is that it's this whole geeky world of stuff that I always knew was out there, but I'd just never really understood. For example, I mentioned waveforming in the video, and I'd never even heard of that let alone understood that there may be science where sound waves are smashed into each other in opposing directions in order to cancel each other out. And I'm sure I've got that completely wrong, but that's what's so fun about this! Anyway, that's all just part of the next adventure, and I hope you enjoy hearing about it and sending over your thoughts because I'm pretty sure there's a gazillion things I don't know yet ๐
It's hard to find a good criminal these days. I mean a really trustworthy one you can be confident won't lead you up the garden path with false promises of data breaches. Like this guy yesterday:
For my international friends, JB Hi-Fi is a massive electronics retailer down under and they have my data! I mean by design because I've bought a bunch of stuff from them, so I was curious not just about my own data but because a breach of 12 million plus people would be massive in a country of not much more than double that. So, I dropped the guy a message and asked if he'd be willing to help me verify the incident by sharing my own record. I didn't want to post any public commentary about this incident until I had a reasonable degree of confidence it was legit, not given how much impact it could have in my very own backyard.
Now, I wouldn't normally share a private conversation with another party, but when someone sets out to scam people, that rule goes out the window as far as I'm concerned. So here's where the conversation got interesting:
He guaranteed it for me! Sounds legit. But hey, everyone gets the benefit of the doubt until proven otherwise, so I started looking at the data. It turns out my own info wasn't in the full set, but he was happy to provide a few thousand sample records with 14 columns:
customer_id_
first_name
last_name
FullName
gender
email_address_
mobile_country_
mobile_number_
dob
postal_street_1_
state_
postal_code_
city_
account_status
Pretty standard stuff, could be legit, let's check. I have a little Powershell script I run against the HIBP API when a new alleged breach comes in and I want to get a really good sense of how unique it is. It simply loops through all the email addresses in a file, checks which breaches they've been in and keeps track of the percentage that have been seen before. A unique breach will have anywhere from about 40% to 80% previously seen addresses, but this one had, well, more:
Spot the trend? Every single address has one breach in common. Hmmm... wonder what the guy has to say about that?
But he was in the server! And he grabbed it from the dashboard of Shopify! Must be legit, unless... what if I compared it to the actual full breach of Dymocks? That's a local Aussie bookseller (so it would have a lot of Aussie-looking email addresses in it, just like JB Hi-Fi would), and their breach dated back to mid-2023. I keep breaches like that on hand for just such occasions, let's compare the two:
Wow! What are the chances?! He's going to be so interested when he hears about this!
And that was it. The chat went silent and very shortly after, the listing was gone:
I wrote this short post to highlight how important verification of data breach claims is. Obviously, I've seen loads of legitimate ones but I've also seen a lot of rubbish. Not usually this blatant where the party contacting me is making such demonstrably false claims about their own exploits, but very regularly from people who obtain something from another party and repeat the lie they've been told. This example also highlights how useful data from previous breaches is, even after the email addresses have been extracted and loaded into HIBP. Data is so often recycled and shipped around as something new, this was just a textbook perfect case of making use of a previous incident to disprove a new claim. Plus, it's kinda fun poking holes in a scamming criminal's claims ๐
TL;DR โย Email addresses in stealer logs can now be queried in HIBP to discover which websites they've had credentials exposed against. Individuals can see this by verifying their address using the notification service and organisations monitoring domains can pull a list back via a new API.
Nasty stuff, stealer logs. I've written about them and loaded them into Have I Been Pwned (HIBP) before but just as a recap, we're talking about the logs created by malware running on infected machines. You know that game cheat you downloaded? Or that crack for the pirated software product? Or the video of your colleague doing something that sounded crazy but you thought you'd better download and run that executable program showing it just to be sure? That's just a few different ways you end up with malware on your machine that then watches what you're doing and logs it, just like this:
These logs all came from the same person and each time the poor bloke visited a website and logged in, the malware snared the URL, his email address and his password. It's akin to a criminal looking over his shoulder and writing down the credentials for every service he's using, except rather than it being one shoulder-surfing bad guy, it's somewhat larger than that. We're talking about billions of records of stealer logs floating around, often published via Telegram where they're easily accessible to the masses. Check out Bitsight's piece titled Exfiltration over Telegram Bots: Skidding Infostealer Logs if you'd like to get into the weeds of how and why this happens. Or, for a really quick snapshot, here's an example that popped up on Telegram as I was writing this post:
As it relates to HIBP, stealer logs have always presented a bit of a paradox: they contain huge troves of personal information that by any reasonable measure constitute a data breach that victims would like to know about, but then what can they actually do about it? What are the websites listed against their email address? And what password was used? Reading the comments from the blog post in the first para, you can sense the frustration; people want more info and merely saying "your email address appeared in stealer logs" has left many feeling more frustrated than informed. I've been giving that a lot of thought over recent months and today, we're going to take a big step towards addressing that concern:
The domains an email address appears next to in stealer logs can now be returned to authorised users.
This means the guy with the Gmail address from the screen grab above can now see that his address has appeared against Amazon, Facebook and H&R Block. Further, his password is also searchable in Pwned Passwords so every piece of info we have from the stealer log is now accessible to him. Let me explain the mechanics of this:
Firstly, the volumes of data we're talking about are immense. In the case of the most recent corpus of data I was sent, there are hundreds of text files with well over 100GB of data and billions of rows. Filtering it all down, we ended up with 220 million unique rows of email address and domain pairs covering 69 million of the total 71 million email addresses in the data. The gap is explained by a combination of email addresses that appeared against invalidly formed domains and in some cases, addresses that only appeared with a password and not a domain. Criminals aren't exactly renowned for dumping perfectly formed data sets we can seamlessly work with, and I hope folks that fall into that few percent gap understand this limitation.
So, we now have 220 million records of email addresses against domains, how do we surface that information? Keeping in mind that "experimental" caveat in the title, the first decision we made is that it should only be accessible to the following parties:
The person who owns the email address
The company that owns the domain the email address is on
At face value it might look like that first point deviates from the current model of just entering an email address on the front page of the site and getting back a result (and there are very good reasons why the service works this way). There are some important differences though, the first of which is that whilst your classic email address search on HIBP returns verified breaches of specific services, stealer logs contain a list of services that have never have been breached. It means we're talking about much larger numbers that build up far richer profiles; instead of a few breached services someone used, we're talking about potentially hundreds of them. Secondly, many of the services that appear next to email addresses in the stealer logs are precisely the sort of thing we flag as sensitive and hide from public view. There's a heap of Pornhub. There are health-related services. Religious one. Political websites. There are a lot of services there that merely by association constitute sensitive information, and we just don't want to take the risk of showing that info to the masses.
The second point means that companies doing domain searches (for which they already need to prove control of the domain), can pull back the list of the websites people in their organisation have email addresses next to. When the company controls the domain, they also control the email addresses on that domain and by extension, have the technical ability to view messages sent to their mailbox. Whether they have policies prohibiting this is a different story but remember, your work email address is your work's email address! They can already see the services sending emails to their people, and in the case of stealer logs, this is likely to be enormously useful information as it relates to protecting the organisation. I ran a few big names through the data, and even I was shocked at the prevalence of corporate email addresses against services you wouldn't expect to be used in the workplace (then again, using the corp email address in places you definitely shouldn't be isn't exactly anything new). That in itself is an issue, then there's the question of whether these logs came from an infected corporate machine or from someone entering their work email address into their personal device.
I started thinking more about what you can learn about an organisation's exposure in these logs, so I grabbed a well-known brand in the Fortune 500. Here are some of the highlights:
2,850 unique corporate email addresses in the stealer logs
3,159 instances of an address against a service they use, accompanied by a password (some email addresses appeared multiple times)
The top domains included paypal.com, netflix.com, amazon.com and facebook.com (likely within the scope of acceptable corporate use)
The top domains also included steamcommunity.com, roblox.com and battle.net (all gaming websites likely not within scope of acceptable use)
Dozens of domains containing the words "porn", "adult" or "xxx" (definitely not within scope!)
Dozens more domains containing the corporate brand, either as subdomains of their primary domain or org-specific subdomains of other services including Udemy (online learning), Amplify ("strategy execution platform"), Microsoft Azure (the same cloud platform that HIBP runs on) and Salesforce (needs no introduction)
That said, let me emphasise a critical point:
This data is prepared and sold by criminals who provide zero guarantees as to its accuracy. The only guarantee is that the presence of an email address next to a domain is precisely what's in the stealer log; the owner of the address may never have actually visited the indicated website.
Stealer logs are not like typical data breaches where it's a discrete incident leading to the dumping of customers of a specific service. I know that the presence of my personal email address in the LinkedIn and Dropbox data breaches, for example, is a near-ironclad indication that those services exposed my data. Stealer logs don't provide that guarantee, so please understand this when reviewing the data.
The way we've decided to implement these two use cases differs:
Individuals who can verify they control their email address can use the free notification service. This is already how people can view sensitive data breaches against their address.
Organisations monitoring domains can call a new API by email address. They'll need to have verified control of the domain the address is on and have an appropriately sized subscription (essentially what's already required to search the domain).
Because of the recirculation of many stealer logs, we're not tracking which domains appeared against which breaches in HIBP. Depending on how this experiment with stealer logs goes, we'll likely add more in the future (and fill in the domain data for existing stealer logs in HIBP), but additional domains will only appear in the screen above if they haven't already been seen.
We've done the searches by domain owners via API as we're talking about potentially huge volumes of data that really don't scale well to the browser experience. Imagine a company with tens or hundreds of thousands of breached addresses and then a whole heap of those addresses have a bunch of stealer log entries against them. Further, by putting this behind a per-email address API rather than automatically showing it on domain search means it's easy for an org to not see these results, which I suspect some will elect to do for privacy reasons. The API approach was easiest while we explore this service then we can build on that based on feedback. I mentioned this was experimental, right? For now, it looks like this:
Lastly, there's another opportunity altogether that loading stealer logs in this fashion opens up, and the penny dropped when I loaded that last one mentioned earlier. I was contacted by a couple of different organisations that explained how around the time the data I'd loaded was circulating, they were seeing an uptick in account takeovers "and the attackers were getting the password right first go every time!" Using HIBP to try and understand where impacted customers might have been exposed, they posited that it was possible the same stealer logs I had were being used by criminals to extract every account that had logged onto their service. So, we started delving into the data and sure enough, all the other email addresses against their domain aligned with customers who were suffering from account takeover. We now have that data in HIBP, and it would be technically feasible to provide this to domain owners so that they can get an early heads up on which of their customers they probably have to rotate credentials for. I love the idea as it's a great preventative measure, perhaps that will be our next experiment.
I thought that doing more than 10 billion requests a month was cool, but look at that data transfer - more than a quarter of a petabyte just last month! And it's in use at some pretty big name sites as well:
That's just where the API is implemented client-side, and we can identify the source of the requests via the referrer header. Most implementations are done server-side, and by design, we have absolutely no idea who those folks are. Shoutout to Cloudflare while we're here for continuing to provide the service behind this for free to help make a more secure web.
In terms of the passwords in this latest stealer log corpus, we found 167 million unique ones of which only 61 million were already in HIBP. That's a massive number, so we did some checks, and whilst there's always a bit of junk in these data sets (remember - criminals and formatting!) there's also a heap of new stuff. For example:
Tryingtogetkangaroo
Kangaroolover69
fuckkangaroos
And about 106M other non-kangaroo themed passwords. Admittedly, we did start to get a bit preoccupied looking at some of the creative ways people were creating previously unseen passwords:
passwordtoavoidpwned13
verygoodpassword
AVerryGoodPasswordThatNooneCanGuess2.0
And here's something especially ironic: check out these stealer log entries:
People have been checking these passwords on HIBP's service whilst infected with malware that logged the search! None of those passwords were in HIBP... but they all are now ๐
Want to see something equally ironic? People using my Hack Yourself First website to learn about secure coding practices have also been infected with malware and ended up in stealer logs:
So, that's the experiment we're trying with stealer logs, and that's how to see the websites exposed against an email address. Just one final comment as it comes up every single time we load data like this:
We cannot manually provide data on a per-individual basis.
Hopefully, there's less need to now given the new feature outlined above, and I hope the massive burden of looking up individual records when there are 71 million people impacted is evident. Do leave your comments below and help us improve this feature to become as useful as we can possibly make it.
This week I'm giving a little teaser as to what's coming with stealer logs in HIBP and in about 24 hours from the time of writing, you'll be able to see the whole thing in action. This has been a huge amount of work trawling through vast volumes of data and trying to make it usable by the masses, but I think what we're launchung tomorrow will be awesome. Along with a new feature around these stealer logs, we've also added a huge number of new passwords to Pwned Passwords not previously seen before. Now, for the first time ever, "fuckkangaroos" will be flagged by any websites using the service ๐ฎ More awesome examples coming in tomorrow's blog post, stay tuned!
A super quick intro today as I rush off to do the next very Dubai thing: drive a Lambo through the desert to go dirt bike riding before jumping in a Can-Am off-roader and then heading to the kart track for a couple of afternoon sessions. I post lots of pics to my Facebook account, and if none of that is interesting, here's this week's video on more infosec-related topics:
I wouldn't say this is a list of my favourite breaches from this year as that's a bit of a disingenuous term, but oh boy were there some memorable ones. So many of the incidents I deal with are relatively benign in terms of either the data they expose or the nature of the service, but some of them this year were absolute zingers. This week, I'm talking about the ones that really stuck out to me for one reason or another, here's the top 5: