73 points mahmoud-almadi 1 week ago 60 comments
Here’s a couple demos of Cyberdesk’s computer use agent:
A fast file import automation into a legacy desktop app: https://youtu.be/H_lRzrCCN0E
Working on a monster of a Windows monolith called OpenDental (showcases agent learning process as well): https://youtu.be/nXiJDebOJD0.
Filing a W-2 tax form: https://youtu.be/6VNEzHdc8mc
Many industries are stuck with legacy Windows desktop applications, with staff plagued by repetitive tasks that are incredibly time consuming. Vendors offering automations for these end up writing brittle Robotic Process Automation (RPA) scripts or hiring off-shore teams for manual task execution. RPA often breaks due to inevitable UI changes or unexpected popups like a Windows update or a random in-app notification. Off-shore teams are often unreliable and costlier than software, plus they’re not always an option for regulated industries.
I previously built RPA scripts impacting 20K+ employees at a Fortune 100 company where I experienced first hand RPA’s brittleness and inflexibility. It was obvious to me that this was a bandaid solution to an unsolved problem. Alan was building a computer use agent for his previous startup and realized its huge potential to automate a ton of manual computer tasks across many industries, so we started working on Cyberdesk.
Computer use models can struggle with abstract, long-horizon tasks, but they excel at making context-aware decisions on a screen-by-screen basis, so they’re a good fit for automating these desktop apps.
The key to reliability is crafting prompts that are highly specific and well thought out. Much like with ChatGPT, vague or ambiguous prompts won’t get you the results you want. This is especially true in computer use because the model is processing nearly an entire desktop screen’s worth of extra visual information; without precise instructions, it doesn’t know which details to focus on or how to act.
Unlike RPA, Cyberdesk’s agents don’t blindly replay clicks. They read the screen state before every action and self-correct when flows drift (pop-ups, latency, UI changes). Unlike off-the-shelf computer use AIs, Cyberdesk runs deterministically in production: the agent primarily follows the steps it has learned and only falls back to reasoning when anomalies occur. Cyberdesk learns workflows from natural-language instructions, capturing nuance and handling dynamic tasks - far beyond what a simple screen recording of a few runs can encode.
This approach is good for both reliability and cost: reliability, because we fall back to a computer use model in unexpected situations; and cost because the computer use models are expensive and we only use them when we need to. Otherwise we leverage faster, more affordable visual LLMs for checking the screen state step-by-step during deterministic runs. Our agents are also equipped with tools like failsafes, data extraction, screen evaluation to handle dynamic and sensitive situations.
How it works: you install our open source driver on any Windows machine (https://github.com/cyberdesk-hq/cyberdriver). It communicates with our backend to receive commands (click, type, scroll, screenshot) and sends back data (screenshots, API responses, etc). You give our computer use agent a detailed natural language description of the process for a given task, just like an SOP for an employee learning a new task for the first time. The agent then leverages computer use AI models to learn the steps and memorizes them by saving each screenshot alongside its action (click on these coordinates, type XYZ, wait for page to load, etc).
The agent deterministically runs through these steps to run fast and predictably. In order to account for popups and UI changes, our agent checks the live screen state against the memorized state to determine whether it’s safe to proceed with the memorized step. If no major changes prevent safe execution of the memorized step, it proceeds; otherwise, it falls back to a computer use model with context on past actions and the remaining task.
Customers are currently using us for manual tasks like importing and exporting files from legacy desktop applications, booking appointments for patients on a desktop PMS, and data entry for filling our forms like patient profiles and such in an EMR.
We don't have a self-serve option yet but we'd love to onboard you manually. Book a demo here to learn more! (https://www.cyberdesk.io/) If you’d rather wait for the self-serve option a little later down the line, please do submit your email here (https://forms.gle/HfQLxMXKcv9Eh8Gs8) so you can be notified as soon as that’s ready. You can also check out our docs here: https://docs.cyberdesk.io/.
We’d absolutely love to hear your thoughts on our approach and on desktop automation for legacy industries!
throw03172019 1 week ago | parent
mahmoud-almadi 1 week ago | parent
hermitcrab 1 week ago | parent
Is it possible to verify that?
sgtwompwomp 1 week ago | parent
herval 1 week ago | parent
feisty0630 1 week ago | parent
downrightmike 1 week ago | parent
mahmoud-almadi 1 week ago | parent
bozhark 1 week ago | parent
mahmoud-almadi 1 week ago | parent
sentientslug 5 days ago | parent
mahmoud-almadi 1 week ago | parent
DaiPlusPlus 1 week ago | parent
+1 for honesty and transparency
sethhochberg 1 week ago | parent
Audit rights are all about who has the most power in a given situation. Just like very few customers are big enough to go to AWS and say "let us audit you", you're not going to get that right with a vendor like Anthropic or OpenAI unless you're certifiably huge, and even then it will come with lots of caveats. Instead, you trust the audit results they publish and implicitly are trusting the auditors they hire.
Whether that is sufficient level of trust is really up to the customer buying the service. There's a reason many companies sell on-prem hosted solutions or even support airgapped deployments, because no level of external trust is quite enough. But for many other companies and industries, some level of trust in a reputable auditor is acceptable.
mahmoud-almadi 1 week ago | parent
feisty0630 6 days ago | parent
You've got two companies that basically built their entire business upon stealing people's content, and they've given you a piece of paper saying "trust me bro".
piltdownman 6 days ago | parent
Digital sovereignty and respect for privacy and local laws are the exception in this domain, not the expectation.
As Max Schrems puts it "Instead of stable legal limitations, the EU agreed to executive promises that can be overturned in seconds. Now that the first Trump waves hit this deal, it quickly throws many EU businesses into a legal limbo."
After recently terrifying the EU with the truth in an ill-advised blogpost, Microsoft are now attempting the concept of a 'Sovereign Public Cloud' with a supposedly transparent and indelible access-log service called Data Guardian.
https://blogs.microsoft.com/on-the-issues/2025/04/30/europea...
https://www.lightreading.com/cloud/microsoft-shows-who-reall...
If Nation States can't manage to keep their grubby hands off your data, private US Companies obliged to co-operate with Intelligence Apparatus certainly won't be.
mahmoud-almadi 6 days ago | parent
mahmoud-almadi 6 days ago | parent
bozhark 1 week ago | parent
rkagerer 1 week ago | parent
mahmoud-almadi 1 week ago | parent
iptq 1 week ago | parent
mahmoud-almadi 1 week ago | parent
rm_-rf_slash 1 week ago | parent
My PC is just good enough to run a DeepSeek distill. Is that on par with the requirements for your model?
sgtwompwomp 1 week ago | parent
So if you come across a local model that can do that well, let us know! We're also keeping a close watch.
rkagerer 1 week ago | parent
mahmoud-almadi 1 week ago | parent
However, I believe ByteDance released UI-TARS which is an excellent open source computer use model according to some articles I read. You could run that locally. We haven't tested it so I wouldn't know how it performs, but sounds like it's definitely worth exploring!
ciaranmca 1 week ago | parent
mahmoud-almadi 6 days ago | parent
rkagerer 1 week ago | parent
- Pricing. If I grow to do this at scale, I don't want to be paying per-action, per-month, per-token, etc.
- Privacy. I don't want my data, screenshots, whatever being sent to you or the cloud AI providers.
- Control. I don't want to be vulnerable to you or other third parties going bankrupt, arbitrarily deciding to kill the product or it's dependencies, or restructuring plans/pricing/etc. I also want to be able to keep my day to day operations running even if there's a major cloud outage (that's one reason we're still using this "old fashioned", non-cloud software in the first place).
I think I'm simply not your target market.
I advise several companies who could be (they run "legacy" software with vast teams of human operators whose daily tasks include some portion of work that would be a good candidate for increased automation), but most of them are in a space where one or more of the above factors would be potential deal breakers.
The retention agreements between you and your vendors are great (I mean that sincerely), but I'm not party to them so they don't do anything for me. If you offered a contractual agreement with some teeth in it (eg. underwritten or bond-backed to the tune of several digits, committing to specific security-related measures that are audited, with a tacit acknowledgement any proven breach of contract in and of itself constitutes damages) it could go a long way to address the privacy issues.
In terms of pricing it feels like the core of your product is an outside vendor's computer-operating AI model, and you've written a prompt wrapper and plumbing around it that ferries screenshots and directives back and forth. This could be totally awesome for a small scale customer that wants to dip their toes into AI automation and try it out as a turnkey solution. But the moat doesn't seem very big, and I'd need to be convinced it's a really slick solution in order to favour that route instead of rolling my own wrapper.
Please don't take this the wrong way, it's just one datapoint of feedback and I do wish you luck with your venture.
mahmoud-almadi 1 week ago | parent
Self hosting is inevitably a part of our roadmap. Cyberdesk will have a future where we host our entire agentic framework on your own servers. AI models and the whole backend included.
I can totally see myself having the same preferences as you if I were you with regards to cost, privacy, and control.
The unique value in Cyberdesk lies beyond being a wrapper around a computer use AI model. Our intelligence caching is built on large evals that help us produce prompts that are highly reliable for the intelligent caching to work well in the first place. On top of that there are several tools that allow the agent to be useful (import/export files, failsafes, taking actions using data that was read during the same run). Rebuilding Cyberdesk, while possible, will require several weeks at the very least of very rapid iteration. So for a dev team that wants to build the best computer use agent in the world, I guess that's doable. But for a team trying to be the best "X" in their particular industry, it's probably going to be a time sink that will take away from their ability to compete well in their space, hence why Cyberdesk is a great choice for them.
I hope you keep an eye on what we're doing! I really like your insights here and I'm curious to see what you think as we evolve over the next months and years. Maybe when we do full self hosting you'll be a customer :)
mattfrommars 1 week ago | parent
Also, to have this run in a large scale, Does it become prohibitively expensive to run on daily basis on thousand of custom workflows? I assume this runs on the cloud.
sgtwompwomp 1 week ago | parent
And yes we've found the computer use models are quite reliable.
Great questions on scale: the whole way we designed our engine is that in the happy path, we actually use very little LLMs. The agent runs deterministically, only checking at various critical spots if anomalies occurred (if it does, we fallback to computer use to take it home). If not, our system can complete an entire task end to end, on the order of less than $0.0001.
So it's a hybrid system at the end of the day. This results in really low costs at scale, as well as speed and reliability improvements (since in the happy path, we run exactly what has worked before).
mattfrommars 3 days ago | parent
I assume you send screenshot to claude for nest action to take, how are you able to reduce this exact step by working deterministically? What is the is deterministic part and how you figure it out?
sgtwompwomp 2 days ago | parent
But during that replayed action, we do bring in smaller LLMs to just keep in check to see if anything unexpected happened (like a popup). If so, we fall back to computer use to take it home.
Does that make sense? At the end of the day, our agent compiles down to Pyautogui, with smart fallback to the agent if needed.
mattfrommars 2 days ago | parent
deepdarkforest 1 week ago | parent
1) The funny thing about determinism is how deterministic you should be when to break, its kind of a recursive problem. agents are inherently very tough to guardrail on an action space so big like in CUA. The guys from browser use realized it as well and built workflow-use. Or you could try RL or finetuning per task but is not viable(economically or tech wise) currently.
2) As you know, It's a very client facing/customized solution space You might find this interesting, it reflects my thoughts in the space as well. Tough to scale as a fresh startup unless you really niche down on some specific workflows. https://x.com/erikdunteman/status/1923140514549043413 (he is also building in the deterministic agent space now funnily enough) 3) It actually is annoyingly expensive with Claude if you break caching, which you have to at some point if you feed in every screenshot etc. You mentioned you use multiple models (i guess uitars/omniparser?), but in the comments you said claude?
4) Ultimately the big bet in the RPA space, as again you know, is that the TAM wont shrink a lot due to more and more SAP's, ERP's etc implementing API's. Of course the big money will always be in ancient apps that wont, but then again in that space, uipath and the others have a chokehold. (and their agentic tech is actually surprisingly good when i had a look 3 months ago)
Good luck in any case! I feel like its one of those spaces that we are definitely still a touch too early, but its such a big market that there is plenty of space for a lot of people.
mahmoud-almadi 1 week ago | parent
1) You're totally right about this problem! We handle this issue with intelligent caching and heavy prompt/context engineering. These measures have been controlling agent behavior pretty well.
2) The key to scaling is building a tool that developers can learn and pick up themselves and that's what we're seeing here. By understanding how to use the tools we built to control agent behavior, developers have been able to leverage our docs to achieve desirable behaviors from our computer use agents when using Cyberdesk.
3) Surely you're correct about cost here as well, but with well defined workflows this will only happen a minority of the times the agent runs.
4) Great point! The beauty of what computer use made possible is it can solve problems previously unsolved by RPA altogether. The TAM will increase significantly when computer use agents start working really well. We've already seen this with our customers: they're able to build automations with Cyberdesk that they weren't able to using RPA. So while the TAM might bleed some ERPs and legacy apps that will implement APIs, I think it's going to grow at a much faster rate than it will shrink.
mwcampbell 1 week ago | parent
throw03172019 1 week ago | parent
mahmoud-almadi 6 days ago | parent
gerdesj 1 week ago | parent
sgtwompwomp 1 week ago | parent
The whole idea of Cyberdesk is the prompt is the source of truth, and then once you learn a task once via CUA, the system follows that cache most of the time until you have to fall back to CUA, which follows the prompt. And that anomaly is also cached too.
So over time, the system just learns, and gets cheaper and faster.
gerdesj 1 week ago | parent
Old school Windows apps are not "flowing" they generally use a toolkit and AutoIT is able to use the Windows APIs to note window handles, or the text in various widgets and so on and act on them.
These are not complicated beasts - they are largely determinant. If you have to go off piste and deal with moguls, you have a mode called "adlib" where you deal with unusual cases.
I find it a bit unpleasant that you describe part of my job as "untenable". I'm sure you didn't mean it as such. I'm still just as cheap as I was 20 years ago and probably a bit quicker now too!
MetaWhirledPeas 1 week ago | parent
sgtwompwomp 1 week ago | parent
kjellsbells 6 days ago | parent
Consider your typical early-2000s era Windows app. It would expect a mouse, but for power users, keyboard shortcuts would be available for every action, even if clunky. For example, Alt F tab tab tab to get to some input field, enter text, tab Alt R Return.
By about 2015 these were all straightforwardly scriptable with AutoHotkey amd similar tools.
But too late: by 2015 even Windows users were using web apps, where the keyboard bindings are variable or non existent, where the entire UI can change overnight, etc. I see some RPA approaches desperately trying to decode the DOM or match pixel elements. It's wild, as you point out.
I guess what I'm wondering if going after legacy Windows apps is a small TAM already largely solved, whereas the SPA/webapp market is gigantic, growing every day, and woefully, miserably, broken as far as automation is concerned.
mahmoud-almadi 6 days ago | parent
rolstenhouse 6 days ago | parent
Can I monitor/manage this remotely? I'm not on site with the client and previously tried to manage through AnyDesk but the client often turned off the machine.
Also is there anyway to run this so that it won't interrupt workflows while someone is using the machine? I imagine a solution could just be having the client run an extra computer that's dedicated for this or running after hours on the local machine.
Scheduled a demo
mahmoud-almadi 6 days ago | parent
You can definitely monitor this remotely by looking at the logs of the runs which includes screenshots, but we don't currently support live streaming of the desktop screen. Our customers usually do that via RDP which is usually straightforward.
Because our agent directly controls the keyboard, mouse, and display, it needs exclusive access (no concurrent users). Our customer typically run it on either a dedicated spare desktop at the client site or a dedicated cloud VM (managed by them or provisioned by us)
tomgs 6 days ago | parent
You have not social share preview image on the homepage: https://www.opengraph.xyz/url/https%3A%2F%2Fwww.cyberdesk.io...
zkxjzmswkwl 6 days ago | parent
You can accomplish this from usermode and you wouldn't give potential customers (anyone who plays modern games) a non-starter for your product.
throw03172019 5 days ago | parent
sgtwompwomp 2 days ago | parent
Run the workflow again and it’ll run through that cached trajectory as best as it can, falling back to computer use if needed.
iamcreasy 4 days ago | parent
How does it know when to stop and ask a human to intervene?
sgtwompwomp 2 days ago | parent
More on this soon! How would you imagine this would be useful?