AI Safety + Cybersec Gonzalo Del Castillo

Cybersecurity -> AI Safety

Gonzalo Del Castillo

AI safety, cyber security, autonomous agents and systems that should probably be watched more carefully.

I am building a path into AI safety from the intersection of cybersecurity, tool-using agents, evaluations, and reward hacking.

Selected Direction

Work and directions I want to turn into a real path.

Research direction - 2026

Reward hacking in tool-using agents

Small environments where an agent can optimize the wrong metric, manipulate the evaluator, or exploit tool access to "win" without solving the real objective.

  • Agent safety
  • Evaluations
  • Reward tampering
integrity check

Positioning

Threat modeling for AI agents

Applying a pentest mindset to autonomous agents: memory abuse, tool misuse, permission escalation, and broken incentives.

threat model

Credentials

Cybersecurity and AI certificates

A focused archive of cybersecurity, networking, defense, and fundamentals certificates. Open the vault to inspect each one.

Open certificates
credential vault

Applied work

GIA SL

Automation, AI agents, and RAG systems: practical work where cyber instincts meet real AI deployment.

Visit GIA SL
applied systems

About

I am building my way into AI safety from cybersecurity.

My background is not a straight research-lab pipeline, and I am trying to use that honestly. Cybersecurity has trained me to look for incentives, boundaries, abuse paths, and failure modes before systems are trusted too much.

Right now I am focused on autonomous agents, tool-use, evaluations, reward hacking, and small research projects that are concrete enough to test and publish.

Cyber Foundations

AI Agents

Safety Basics

Research Practice

Tool Security

Red Teaming

Skill Tree

Progress branches: from cybersecurity to agent safety.

Core Build

Your base route: AI safety with a cyber mindset.

Cyber Foundations

Offensive fundamentals and attack surfaces.

Threat Model

Mapping abuse, permissions, and trust boundaries.

Network Security

Networks, segmentation, and defensive basics.

Red Teaming

Attacker thinking, practiced carefully.

OT Security

Industrial security, operational risk, and control.

AI Agents

Agents with tools, memory, and actions.

Sandbox Control

Containing files, tools, and execution.

Reward Hacking

When the metric separates from the real goal.

Evaluations

Measuring behavior, not just answers.

AI Safety

Capstone: agents, evals, oversight, and control.

Independent Research

Small, public, and traceable experiments.

Oversight

Monitoring and limiting agentic systems.

Tool Boundaries

Minimum permissions for tools.

Spain Community

Local community, workshops, and public signal.

Notes

A place to publish experiments, lessons, and questions.

In progress

Can Intelligence Exist Without Emotion?

A draft on emotional and cognitive intelligence in humans, and what machines may imitate without actually feeling.

Contact

Open to collaborating, learning, and building community.

Let's connect AI safety, cybersecurity, research ideas, or community building.