Reward hacking in tool-using agents
Small environments where an agent can optimize the wrong metric, manipulate the evaluator, or exploit tool access to "win" without solving the real objective.
AI Safety + Cybersec
Gonzalo Del Castillo
Cybersecurity -> AI Safety
Gonzalo Del Castillo
I am building a path into AI safety from the intersection of cybersecurity, tool-using agents, evaluations, and reward hacking.
Selected Direction
Small environments where an agent can optimize the wrong metric, manipulate the evaluator, or exploit tool access to "win" without solving the real objective.
Applying a pentest mindset to autonomous agents: memory abuse, tool misuse, permission escalation, and broken incentives.
A focused archive of cybersecurity, networking, defense, and fundamentals certificates. Open the vault to inspect each one.
Open certificatesApplied cybersecurity and systems work: the practical ground where my safety angle starts becoming useful.
Visit GIA SLAbout
My background is not a straight research-lab pipeline, and I am trying to use that honestly. Cybersecurity has trained me to look for incentives, boundaries, abuse paths, and failure modes before systems are trusted too much.
Right now I am focused on autonomous agents, tool-use, evaluations, reward hacking, and small research projects that are concrete enough to test and publish.
Skill Tree
Your base route: AI safety with a cyber mindset.
Offensive fundamentals and attack surfaces.
Mapping abuse, permissions, and trust boundaries.
Networks, segmentation, and defensive basics.
Attacker thinking, practiced carefully.
Industrial security, operational risk, and control.
Agents with tools, memory, and actions.
Containing files, tools, and execution.
When the metric separates from the real goal.
Measuring behavior, not just answers.
Capstone: agents, evals, oversight, and control.
Small, public, and traceable experiments.
Monitoring and limiting agentic systems.
Minimum permissions for tools.
Local community, workshops, and public signal.
Notes
A draft on emotional and cognitive intelligence in humans, and what machines may imitate without actually feeling.
Contact