What is HTTP and how does it work?

The Hypertext Transfer Protocol (HTTP) is the invisible yet fundamental cornerstone of the modern internet. It is the set of rules that defines how web clients (such as browsers) and web servers exchange information. Understanding how HTTP works is critically important not only for web developers but for anyone working with network technologies, including proxies and web scraping.

Fundamentals of HTTP: definition and role

HTTP (Hypertext Transfer Protocol) is an application layer protocol used for transmitting hypermedia documents, such as HTML. It was developed in the early 1990s by Tim Berners-Lee and has since become the foundation of communication on the World Wide Web.

Key characteristics:

Client-Server Protocol: Interaction is always initiated by the recipient (the client), who sends a request, and the sender (the server) returns a response.
Stateless Protocol: Each request-response pair is processed independently of the previous ones. The server does not remember the client’s previous requests by default. This makes HTTP scalable, but requires additional mechanisms (like cookies) for managing user sessions.
Uses TCP/IP: HTTP usually uses TCP (Transmission Control Protocol) to ensure reliable data delivery, most often over port 80 (for HTTP) or 443 (for HTTPS).

The request-response cycle

The operation of HTTP is built around a simple cycle:

Connection Establishment: The client (your browser) establishes a TCP connection with the server.
Sending the Request: The client sends an HTTP request.
Request Processing: The server receives the request, processes it (e.g., finds the required file or runs a script).
Sending the Response: The server sends an HTTP response with the requested data.
Connection Closing: The connection (may be) closed. In modern versions (HTTP/1.1 and later), connections often remain open for subsequent requests (persistent connections).

Structure of HTTP messages

HTTP messages (both requests and responses) consist of three main parts:

Structure of an HTTP request

Part	Description	Example
Start-line	Defines the action.	`GET /index.html HTTP/1.1`
Headers	Provide metadata about the request, client, and body.	`Host: proxyverity.com, User-Agent: Mozilla/5.0`
Empty Line	Separates headers and the body.	`\r\n`
Body	Contains data sent to the server (e.g., form data for a POST request).	`{ "username": "test" }`

Structure of an HTTP response

Part	Description	Example
Status-line	Contains the protocol version and status code.	`HTTP/1.1 200 OK`
Headers	Provide metadata about the response, server, and body.	`Content-Type: text/html, Date: Tue, 15 Oct 2024`
Empty Line	Separates headers and the body.	`\r\n`
Body	Contains the requested data (e.g., HTML code, image, JSON).	`<html>...</html>`

HTTP methods (verbs)

The request method indicates the desired action to be performed for a given resource.

Method	Purpose	Idempotence*	Safety**
`GET`	Requests data from a specified resource.	Yes	Yes
`POST`	Submits data to be processed (e.g., form data, file upload).	No	No
`PUT`	Replaces all current representations of the target resource.	Yes	No
`DELETE`	Removes the specified resource.	Yes	No
`HEAD`	Requests response headers as if it were a `GET` request, but without the response body.	Yes	Yes
`OPTIONS`	Describes the communication options for the target resource.	Yes	Yes

* Idempotence means that multiple executions of the request yield the same result as a single execution. GET, PUT, DELETE are idempotent. POST is not, as each POST request may create a new resource (e.g., a new database entry).

** Safety means that the request does not alter the state of the server.

HTTP status codes

A status code is a three-digit number that informs the client about the result of the request.

Range	Meaning	Common Examples
1xx (Informational)	Request received, continuing process.	`100 Continue`
2xx (Success)	Request successfully received, understood, and accepted.	`200 OK` (Success), `201 Created`
3xx (Redirection)	Further action needs to be taken to complete the request.	`301 Moved Permanently`, `302 Found`
4xx (Client Error)	The request contains bad syntax or cannot be fulfilled.	`403 Forbidden`, `404 Not Found`, `429 Too Many Requests`
5xx (Server Error)	The server failed to fulfill a valid request.	`500 Internal Server Error`, `503 Service Unavailable`

The role of HTTP in web scraping

In the context of web scraping, you act as the client, sending HTTP requests manually or via a script (e.g., using Python Requests).

Compliance: The protocol dictates which methods you can use (GET for data retrieval) and how to handle responses (e.g., recognizing code 403 as access denial).
Header Management: In scraping, you frequently manipulate headers:
- User-Agent: Changed to mimic a real browser to avoid being blocked.
- Accept-Encoding: Manages data compression.
- Referer: Indicates where the request originated from.
Session Handling: Since HTTP is stateless, you must manage Cookie files within headers to maintain logins or preserve settings.

HTTP and proxy servers

A proxy server is an intermediary that sits between the client and the final server. In the context of HTTP, it plays a central role.

How proxies affect HTTP communication:

Forward Proxy: When a client uses a proxy, the client’s HTTP request is first sent to the proxy server. The proxy analyzes the request, extracts the target address, and sends a new HTTP request to the final server on its behalf. Upon receiving the response, the proxy forwards it back to the client.
- Useful for: Anonymity (hides the client’s IP address), bypassing geo-restrictions, caching.
Reverse Proxy: Sits in front of a server (or a cluster of servers) and intercepts all incoming HTTP requests.
- Useful for: Load balancing, security (protecting the server from direct attacks), SSL encryption.
Intermediary Headers: Proxies can add or modify headers, such as X-Forwarded-For to indicate the client’s original IP address, or Via to specify that the request passed through a proxy.

Evolution of HTTP: from 1.1 to 3

Over the years, the protocol has evolved to meet the demands for speed and efficiency:

HTTP/1.1 (1997): Added crucial features like persistent connections (Keep-Alive), which allowed multiple request-responses to be sent over a single TCP connection, significantly reducing latency.
HTTP/2 (2015): Introduced multiplexing, allowing multiple requests and responses to be sent simultaneously over a single TCP connection. This eliminated the “Head-of-Line Blocking” problem in HTTP/1.1 and made web page loading significantly faster.
HTTP/3 (2022): Based on the QUIC protocol (instead of TCP). QUIC uses UDP, providing even faster connection establishment (0-RTT ideally) and eliminating blocking issues at the transport layer. This is the most modern and fastest version of the protocol.

Conclusion

HTTP is not just a way to transmit text. It is a meticulously designed protocol that ensures structured and reliable information exchange over the network. Its simple Request-Response model, system of methods, and status codes allow web applications to operate predictably. For those using proxies or engaging in scraping, understanding the internal mechanisms of HTTP is the key to building efficient, reliable, and robust systems.