You need access to their data to process it, any layer of indirection (like a database they control) is additional complexity without meaningful benefit. For clients with strict data control requirements, self-hosting of the whole system is the standard solution (with a very high licensing fee).
Something to keep in mind is that some clients are not operating in good faith, their goal isn't to work together to find a solution but to present roadblocks. The reasoning can be complicated, perhaps there's internal politics around which solution to use, perhaps your solution is receiving pushback because it's not the preferred solution of one stakeholder. You'll probably never know the true motivations, it's important not to get caught up in engineering a solution to a problem that doesn't really exist.
You've mentioned that the data you need access to is code: GitHub is a perfect comparable. GitHub's cloud service is used by the majority of companies with code, in fact, I'd guess even your clients are using GitHub's hosted services. If the problem is that your company doesn't have the reputation necessary to give these clients confidence that you can securely manage their code, that may just be a sign that right now, these clients aren't the right fit for you, and you should work with less antsy clients until you have built up the credibility.
> their goal isn't to work together to find a solution but to present roadblocks. The reasoning can be complicated..
Or as simple as “the less I appear to value this solution, the lower the supplier will estimate my maximum price for it”
That is very valid. My problem is that a large portion of my possible clients seemed to be happy with the idea of the solution I provided. I was looking in it tech wise because I somewhat validated it for my current client-space.
Self-hosting seems like the most reliable option for the time being (or executing functions on the encrypted data without decrypting it) however, is it standard practice that I use Kubernetes to give them a preconfigured database that they can deploy on my own cloud? I wouldn't access the code except temporarily through a little script that talks to my cloud that comes along the database in the pod that they "self host." Would that be considered standard practice?
No, that wouldn't be considered standard practice. Fundamentally, if you are able to control the code that executes then you can exfiltrate the data regardless of how it is stored. The reason self-hosting is a secure way to execute code against data is because it removes the code from your control: with self-hosting, you would give your code over to the client and then they would run it in their environment.
Providing your customers with their own database in your environment is a method for segregating their data and ensuring that there's no unintentional co-mingling of their data with other customers (which is a common problem in a multi-tenant environment) but it does not protect the customer data from being accessed by you: if code you are executing can access the data, then you can access the data.
Reading between the lines ("a large portion of my possible clients seemed to be happy with the idea of the solution I provided") it sounds like my initial understanding of the situation was incorrect: I thought that you had been asked to build this specific architecture by your clients but it sounds like it's the opposite: you've had an idea, come up with an architecture and then validated that idea with potential clients by describing the architecture? Is that correct?
If that's the actual situation, I think this is a much simpler problem to solve. Architecture is architecture, it isn't a part of the solution, it's a means to an end. There are a very small number of clients who may have strict security/compliance requirements that do necessitate this sort of complexity (which is where self-hosting comes in) but for the majority of clients, how the product works is immaterial, they care only about the results.
Realising that you've made a terrible mistake when building a system using the architecture you designed 6 months ago is a rite of passage, it is the process: every vision you have today for how your system will work is probably going to be wrong 6 months from now. That's completely normal, you will learn more about how your system should work in 1 month of building than you would in 6 months of planning.
Try to take a step back from thinking about architecture. One of the biggest dangers when working on an early stage technology product is committing yourself to a technical direction that then dictates the product direction. If, for example, you decide today to build a system that in which clients self-host the database that your code accesses, and then you decide you want to build a feature that requires 10x as many queries to the database, oops, you can't build that, because it would require your clients upgrade their self-hosted database resources, and getting them to do that will be all but impossible.
If you want to share more about your idea, I can outline some ideas about how I might approach building it in a cheap way that allows for validating the idea. There are exceptions but nowadays, given the maturity of the software development space, most ideas can be built and launched to validate with real customers in 1 month. If your vision for how you'll build something requires, 3, 6 or 12 months to get customers using it, it's probably over complicated.
Two options I’ve seen:
Customer Managed Keys - You have everything encrypted in your database via a key the customer has. You request (likely automated) that key every time you process the data. They can revoke at any point, and have an audit log of every access.
Self Hosting - Let the customer host your solution themselves or automate spinning up a cloud environment for them that they have full control over.
Both are kind of a pain to implement, but that lets you charge more for these enterprise features.
I see, I heard about "fully homomorphic encryption" which is faster to implement and allows you to run code on encrypted data but the time complexity is O((10^6) * n) which is insane.
Confidential Computing also provides data-in-use protection and has a significantly more realistic overhead, often <10% in real-world workloads I've seen. However, in this case you might want to combine it with customer managed keys (BYOK) or self-hosting anyways - otherwise the customer has no opportunity to perform remote attestation and prove you're really running in Confidential Computing.
The visualization about halfway down https://www.anjuna.io/solution/secure-ai (my employer) is an example of the self-hosted flavor of this. Happy to discuss deeper, my contact info is in my bio.
Do they hate that it's unencrypted in the DB, or that the DB's storage itself is unencrypted?
(for my business, anyway) I've found this wording to be enough for bigger customers:
Data is stored on AWS RDS, encrypted at rest by an industry standard AES-256 encryption algorithm (more on that here: https://aws.amazon.com/rds/features/security/)
My main problem is that I need to do operations on the data while it's in the DB. This means that I cannot leave it encrypted end-to-end there.
When RDS is encrypted at rest, it means that the data stored in the database is encrypted while it resides on disk. Means that the data is protected against unauthorised access to raw storage.
The data accessed by the app is not encrypted, you can still work on the data as you would usually do. It's mostly a compliance thing. Not sure what level of security it _actually_ brings to the data itself, but most companies are okay with "encryption at rest".
Encryption at rest is meant to protect data when the storage device is stolen or lost.
Sure you can. You just can’t do zero knowledge encryption.
How is that possible?
Confidential Computing is a way in which cloud providers let their customers encrypt data “in-use” - that might be what you’re looking for.
Sounds like it's exactly what I need. Thank you!
Yeah exactly this. Especially if you need to programmatically process that data too. You can even let the customers provide their own managed key too (such as AWS externally managed KMS) in combination with something like AWS nitro enclaves.
I’ve enjoyed building on nitro myself and most things should run in it just fine, just need to build the networking vsock proxy into the nitro image for anything that needs networking (such as DB, where you store the encrypted at rest data).
Are you using one database per customer or a shared database (with an additional key on the tables)?
Because for enterprise clients they're going to want their own database. Which has it's own licensing and operating costs - that you should be building into your price. And since they will have their own database it can be encrypted with a key that is unique to them.
For small business customers, a shared database is the only way to stay profitable.
Disclaimer: I work for Snowflake.
This idea (customer owns the data, code is deployed next to the data, data never leaves customer perimeter) is the exact use case for the native application framework:
https://docs.snowflake.com/en/developer-guide/native-apps/na...
I lead an open source nonprofit which deploys things like this. Feel free to shoot me a DM on Twitter. Handle is @iamtrask
Why do they hate the idea?
It’s not clear what the core problem is. Are they contractually or by law obligated to comply with security/privacy requirements? Are they afraid you’ll misuse their data (steal their business, etc).
If you can be explicit about what “hate” means, you can find a solution, or decide this is not a potential customer.
They are not comfortable with the fact that I can look at their code base whenever I want.
So we recently had to do something like this for PCI DSS certification. The database is encrypted at rest (AWS RDS), but the data is presented as clear text to any DBA. The solution we came up with was to add field-level encryption to certain Card Holder Data (CHD) fields like Account etc. To do this, we use AWS KMS to encrypt/decrypt the data and then we only grant the rights to use this key to to an IAM Role that they database holds and explicitly prevent any Admin accounts from accessing it. End result is that Admins can manage the database, but can't see all of it in the clear.
I would ask them what their ideal setup is and then compare feasibility. There's probably a lot of indirections/hoops you could jump through but if your security concerns are being driven by your customers you should probably ask them. If it is the case that you need to access their unencrypted data then at one point or another you're going to have to do it, the question is which possible way would your customers feel happiest about? On-premises contract, storing encrypted + customer-specific decrypt keys with a managed auth service, etc etc
Sounds overly complicated. Use at-work encryption (i.e. encrypt it in the database), on top of encryption in-transit and at-rest, hosted/managed by a reputable database vendor. If that won't fly, then I agree with the (enterprise) self-hosted offering another commenter mentioned.
The problem is that I cannot do that. I need to run code on the data which means I can access the data theoretically any time and thus my client is super uncomfortable with that considering I need to access their code base.
Are they uncomfortable with you accessing their data, or are they uncomfortable with you storing their data unencrypted, risking their IP in case of a breach? Two different things.
The former means they aren't a fit for SaaS (i.e. offer self-hosting), and the latter means you can use at-work encryption, only decrypting the data to process it.
Without more info on what you're actually building, I can't really be of more help here.
It's the latter. It's pretty much an agent for their GitHub repo. The agent needs access to their code and keeps some kind of knowledge that it generated in a tree database. Wouldn't it be considered a red flag that I can access their data whenever I want? If I used at-work encryption that just means that I have the ability to access to data whenever I want. However if they did some sort of self-hosting then I can only access it temporarily via APIs and thus I can only access the data temporarily when &they want.
It boggles the mind that there are now software developers who apparently have no concept of building software for people to install on their own computer(s).
This is an option but it's too much of a hassle. You can't store the company data all on one single computer. An alternative would be to create some sort of intranet inside the company only for this data. This would also mean that we can only process the data if the computer was open.
> This would also mean that we can only process the data if the computer was open.
The whole point of keeping the data on computers a client controls is that they have control over it - if you're just going to pull it out somewhere else for fuck knows what purposes, it defeats the purpose of them storing the data.
The solution is to have software that is installed on computer(s) they control. Whether "process" you need to run should run on their computer(s), under their control.
If that is "too much of a hassle", it sounds like they'll be better served by a vendor that actually understands their concerns and requirements
Why not run the database in a docker container, one for each client? They could even run on the same machine.
That makes sense, I could add some code in the container that can communicate with private APIs in my servers. Is this standard practice or just an adhoc solution?
It’s the scenario Kubernetes was created for.
Thank you!
why not process their data in the frontend using WASM?
This cross my mind too (as a why hasn't this been built?).
For now I think a few meh options:
You let the customer BYO encryption keys. You need tech savvy customers.
You offer to install on customers cloud. Also tech savvy.
For some solutions, a web front end that talks to Dropbox etc. may suffice. You just serve up static HTML.
Desktop app.
Web app with download/upload like Exaclidraw.
The main problem is that I need to process their private data frequently. This means end-to-end encryption is not really an option.