<?xml version="1.0" encoding="UTF-8"?>
  <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
  <!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.18 (Ruby 2.6.10) -->


<!DOCTYPE rfc  [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">

<!ENTITY SELF "[RFC-XXXX]">
]>


<rfc ipr="trust200902" docName="draft-jimenez-tbd-robotstxt-update-00" category="info" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true">
  <front>
    <title abbrev="robots-proposal">Robots.txt update proposal</title>

    <author initials="J." surname="Jimenez" fullname="Jaime Jimenez">
      <organization>Ericsson</organization>
      <address>
        <email>jaime@iki.fi</email>
      </address>
    </author>

    <date year="2024" month="November" day="06"/>

    <area>Applications</area>
    <workgroup>ai-control</workgroup>
    

    <abstract>


<?line 28?>

<t>This document proposes updates to the robots.txt standard to accommodate AI-specific crawlers, introducing a syntax for user-agent identification and policy differentiation. It aims to enhance the management of web content access by AI systems, distinguishing between training and inference activities.</t>



    </abstract>

    <note title="About This Document" removeInRFC="true">
      <t>
        Status information for this document may be found at <eref target="https://datatracker.ietf.org/doc/draft-jimenez-tbd-robotstxt-update/"/>.
      </t>
      <t>
        Discussion of this document takes place on the
        ai-control Working Group mailing list (<eref target="mailto:ai-control@ietf.org"/>),
        which is archived at <eref target="https://mailarchive.ietf.org/arch/browse/ai-control/"/>.
        Subscribe at <eref target="https://www.ietf.org/mailman/listinfo/ai-control/"/>.
      </t>
    </note>


  </front>

  <middle>


<?line 32?>

<section anchor="introduction"><name>Introduction</name>

<t>The current robots.txt standard inadequately filters AI crawlers due to its reliance on a "user-agent name" based approach and limited syntax. It is difficult to differentiate based on the intended use of data, such as storage, indexing, training, or inference.</t>

<t>We submitted the following proposal to the AI-Control WS: https://www.ietf.org/slides/
slides-aicontrolws-ai-robotstxt-00.pdf based on further discussion, the following text may describe a solution to the 
problems described in the WS.</t>

<section anchor="terminology"><name>Terminology</name>

<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
appear in all capitals, as shown here.
<?line -6?></t>

<t>This specification makes use of the following terminology:</t>

<dl newline="true">
  <dt>Crawler:</dt>
  <dd>
    <t>A traditional web crawler. Also crawlers that relate to AI companies but that do not use the gathered content to train any model, LLMs or otherwise, as their purpose is purely real-time data integration for inference.</t>
  </dd>
  <dt>AI Crawler:</dt>
  <dd>
    <t>A specialized type of crawler employed by AI companies, which utilizes the gathered content exclusively for training purposes rather than for inference.</t>
  </dd>
</dl>

</section>
<section anchor="user-agent-update"><name>User-Agent Update</name>

<t>Crawlers are normally identify with the HTTP user-agent request header, the source IP address of the request or reverse DNS hostname of it.</t>

<t>A draft that defines a syntax for <spanx style="verb">user-agents</spanx> would be necessary. The syntax has to be extendable, so that not only AI but potentially other crawlers can use it. it should not be mandatory for clients to implement as it should be backwards compatible.</t>

<t>An absolutely minimal syntax would be similar to what we see in the wild, most AI companies use the <spanx style="verb">-ai</spanx> characters at the end of the user agent name to indicate that the crawler is used for ingesting the content into an AI system, for example:</t>

<figure><artwork><![CDATA[
  User-agent: company1-ai
  User-agent: company2-ai
]]></artwork></figure>

<t>Otherwise we could reuse identifiers like <eref target="https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml">URNs Namespace</eref> (e.g., urn:rob:...), <eref target="https://datatracker.ietf.org/doc/html/draft-ietf-core-href-16">CRIs</eref> or cryptographically derived identifiers ... there are dozens of options on the IETF so it is a matter of choosing the right one.</t>

<t>The <spanx style="verb">-ai</spanx> syntax would indicate that the crawler using it is interested in training.
In this draft we treat inference as a separate process akin to normal web-crawling and thus already covered.</t>

<t>This approach different from draft-canel-robots-ai-control, as it does not require a new field in the robot.txt ABNF as shown below:</t>

<figure><artwork><![CDATA[
User-Agent-Purpose: EXAMPLE-PURPOSE-1
]]></artwork></figure>

</section>
<section anchor="robotstxt-update"><name>Robots.txt Update</name>

<t><eref target="https://datatracker.ietf.org/doc/html/rfc9309#name-formal-syntax">RFC9309 ABNF</eref> should be updated to address the new User-agent syntax. If we continue with the <spanx style="verb">-ai</spanx> convention above, we could use regex to indicate different policies to AI crawlers. For example:</t>

<t><list style="symbols">
  <t>Disallow all AI-training</t>
</list></t>

<figure><artwork><![CDATA[
User-Agent: .*?-ai$ Disallow: /
]]></artwork></figure>

<t><list style="symbols">
  <t>Allow all images for training but disallow training on /maps for all AI agents that do AI training.</t>
</list></t>

<figure><artwork><![CDATA[
User-Agent: .*?-ai$ Allow: /images
Disallow: /maps*
]]></artwork></figure>

<t><list style="symbols">
  <t>Allow /local for cohere-ai</t>
</list></t>

<figure><artwork><![CDATA[
User-Agent: cohere-ai Allow: /local
]]></artwork></figure>

<t>This proposal is also different that the new control rules <spanx style="verb">DisallowAITraining</spanx> and <spanx style="verb">AllowAITraining</spanx> proposed by <eref target="https://datatracker.ietf.org/doc/draft-canel-robots-ai-control/">draft-canel-robots-ai-control</eref>. From a semantic perspective, it is problematic to create specific purpose-oriented lines that fullfill such as DisallowThisProperty and DisallowAnotherProperty that have the same meaning and effect as the existing verbs Disallow and Allow.</t>

<t>In our proposal the information about the agent's purpose is on the User-Agent itself, which enables to filter out AI training agents using simple regex and the existing semantics.</t>

</section>
</section>
<section numbered="no" anchor="acknowledgements"><name>Acknowledgements</name>

<t>The author would like to thank Jari Arkko for his review and feedback on short notice.</t>

</section>


  </middle>

  <back>



    <references title='Normative References' anchor="sec-normative-references">



<reference anchor="RFC2119">
  <front>
    <title>Key words for use in RFCs to Indicate Requirement Levels</title>
    <author fullname="S. Bradner" initials="S." surname="Bradner"/>
    <date month="March" year="1997"/>
    <abstract>
      <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
    </abstract>
  </front>
  <seriesInfo name="BCP" value="14"/>
  <seriesInfo name="RFC" value="2119"/>
  <seriesInfo name="DOI" value="10.17487/RFC2119"/>
</reference>

<reference anchor="RFC8174">
  <front>
    <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
    <author fullname="B. Leiba" initials="B." surname="Leiba"/>
    <date month="May" year="2017"/>
    <abstract>
      <t>RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.</t>
    </abstract>
  </front>
  <seriesInfo name="BCP" value="14"/>
  <seriesInfo name="RFC" value="8174"/>
  <seriesInfo name="DOI" value="10.17487/RFC8174"/>
</reference>




    </references>





  </back>

<!-- ##markdown-source:
H4sIALwALGcAA41Y224cNxJ976/gSgs4CqZnrCTIrge760x0gWXotrrAWRgB
xOlmzzDDJntJtkZjw/st+ZZ82Z4i2d0zshNEgK2+sMmqU6dOVSnP8+xxyr7N
Mi+9ElO2d2Pmxruxf/KsbUruBWusaYzjai/j87kVWG7Dmrx7kZWm0LzG16Xl
lc9/kbXQ4kPu52Uel2K3PO6WK/znfJYV+L0wdjNlUlcmc94KXk/Z2cndacZx
PWWzplESy6TRLlsbu1pY0zZTxmVeGO2tUVn2KHQrphljNZdq+9UPUvhqbOwi
y3jrl8ZOs5xFI99y2MfeRiPxKRZN2YmVhXNG417EvX6hZT/IlRxXMsu0sTVM
ecRhGRk83AkN6DZkw+3J+SkQfH9zepT/hJ+f97Isz3PG5/COF3D6bikdA1ot
zvYJWOES0I55w/xSJHhDCJznuuS2pFe8KExdmxCT2VnuGlHIShassHythHUj
IAnXy7aQesE4cxvt+RODrax1wuZ8QYfKkgyuErAM27PGAOcNK2VVCUtvw6sx
O/MAtA5mCb3kuhDBvJprbBU8MBVbizkjzOkWFgrn2HwD+3C886KGVaV0Hha1
0i3JsLnwayE0AyRSB0thAiClo3ECcJKP0kvhxhG9WpalElm2z86Se8Hwj/ty
6/YTYStY0Vpy4IsISs1L8d8W6KkNq6TygIzs7OBjZSvIVekds0LJ4C8hxPa2
4CMK7bE5d6JkvEEEebEMHihZS4+HEfUAHsUamMqiVZ523gZYpD1wAGEqCcAS
9ziKUEWQ+Yi5ljZ38MFYHE8BLsUTMBv16I1A3wE9QPZO4LM5bCFjaOvKKGXW
BHSXrx3PQKKjmC3s3e2ULb1v3HQyWa/X4y57Jk6BMG6Sxd85lym/1nS9ld0v
X46bshqcqlqLIywFv2idQ4hGz6zxAtGpOXgnXGHlXBBljWpDdJOFGUyeK7Co
X0RxDK/e3Y4ZSLHP7oStpTbKLDYghR/uEidWYsOgHqVjexf3t3d7o/ibXV6F
65uTf9+f3Zwc0/Xtm9n5eX+RpRW3b67uz4+Hq+HLo6uLi5PL4/gxnrKdR9ne
xew/eEPs2Lu6vju7upyd70X7t2UAakf+ziMLbGMFhY67bMfnH4+uf/v18Dv2
8eNfoC/fHB6++vQp3fz98G/f4Wa9FDqeZjQoHm8B1SYDUQUnmjCuFCt4Iz1X
yEyi1tKsNUOkwJ1/vFZSC5Z///pfWdKqTmSiWNR8RWoVKfo8mj3u0MSP00fX
8EJ8yo5idk0z6DmRtpS0EzgYdCO+HLOZcmZIRL/knlKQsgTAUI6auuEamsDm
rY/vS8O08cEYsmTBiW6AqtMiYhDlCADZMKimUCN2fn7hKF8MrV1LJwIEuJGW
Na0lMaakxSVpBIqQyj3VCsrGEJyFjUBUz5IOJu44GmDjSn6gHNw0Aa/kHspL
o8wGL6JM9q6NEDGJfEcC0Ifuy26Jp0K1DpWHRAxW9Cqa7Id2hU8IpM/tRLrc
k5bNgpbdh7KTdTFygYmh0CnsngoFiCT9Mhjz5u7ueruUWMgpajnYA2W1Mb+d
aS108+ya8bK0VA0SVbrFsAg9BE4T7Pjyli2N86SptEx6gjI2ESnIogIj3W4x
exhMcA9I7VaVlDxaUPHhdjNmlPbpgyV3KbkgN5BYDjmBsJq4PzEoJAsCQcxq
jA/yTP4HkgykLAAnkQ1G4h/lDZ1LG8xDSQSU6GaCgYWSZFsoJgh2LJWwY/hs
TvJfrNacZCkwwEsYRu5rahdIBinCSCmJaHTO9L46FBuFjMYJa/JjjUdCdNK4
lqocgfNAeyd3ulx5gHQ/sGLJqSkJcffhuSDpiNEiiNlQ8YIruiQdEBE5WtQx
Woaty8S2hQjlPq5IrEXuoH/RQ1swCovFEyd8IBj/ww86qPs+stNk9+YQxn75
zTf0JnyYXXUJTUgUASUrQrRSt0NeKrkS7P39zaVjl/ApCNTPX+1UPXQ2oepx
VKyFpri5SWt1rrv1z2/HT0tfqwP2lRgvxiOGl1NUrOl4PD4YsfdHN2duOIFU
hPrAFRSvr6+oAhPaYhJbZ3qOFtaKfGlFlR9+f0AJU9hN4w3Ep4FCBHIi36AB
5Y5/OJQFtQh5XJoPQofsM01oobtWgzpsSgAZGhQO7qJVsEGhlsa4LnRWLpaU
HETKu540O0T8fUa0YZt4Qihq4EQq3UmvxtlZVwdDuiNuNAH47U4w5L1oQNM4
hYTmkq9kaA+iTlEZycOpXSfply0WKexVbkCFR1LPcapmfcPWd2KssqZOYwsy
XKjU0uTDJDFKqVsapBDlOymZJJQhOWt0kkL1TUn4OLSdsx8vT4f6Ohcokx3N
BwXOr6NoY/z4aXZxfX6SX9/fXF/dnuSHidgQ7K2BrBNsmjBeffvyVTjlzxLM
VgV9s0/kzcMAo/IYzoMtWYqjSJw3koCTY+TpkIJDk1vFhAMHMYcNlSIpjNGP
RE9qoueIxGjITspNKxbiaUdZhrCEkUTGkWirRx+z0x3ZyNmxREMLcENng4a2
49dnWE/Z+OvXMOuv/SdTNkko52hAuj2gt1Cw3dpKpaHsDuqfwq1JzZu4Nh4f
JdP1HQqeDIT/fYtmyZx4drZlIO3/9TMrJ8pABGKlMZTvpIOfb96/67cP36XN
Qj70AwHlBrVgQwD6lKbIp0xgtlWA5qGzb3Z2l5x7CKn3MHv+NE24odd5/4dZ
9idY/IffTw7ADcplkgyUY4/BuAFj0IjRnD5KYpTmCU6vPbWcgnjXj9KpicqN
pQouaKrTIoWzapXC2Kj6qayDgaC8hqPC+k3AocdHhx6ifxe2WfLHWIUd1dVa
8H4GFsC+8KkjBcnj2MygYHO3xXOsDDiDUJBQtFtbc12YJNMfJ2LWtTGKgZcv
3HaXm8rBVj+IyVeoqutDhaZeKWRgnJYZbbbF6I7sUexdaHVSUkch3nKiiwkN
9ftsVqy0QUKX8c8IDvMC0209J63+5wttXqTRLf7lJlWbUL7DXMj1ir3lFsS2
q5UJiUBsRlspRQSoEqKkBouchLbZ0OlJaoH/DxK5RBPxEgAA

-->

</rfc>

