If facing a captive portal, we may seem to get a tcp level connection okay
but find that communication is silently dropped, leading to us timing out
in LRS_WAITING_SERVER_REPLY.
If so, we need to handle it as a connection fail in order to satisfy at
least Captive Portal detection.
This fixes the proxy rx flow by adding an lws_dsh helper to hide the
off-by-one in the "kind" array (kind 0 is reserved for tracking the
unallocated dsh blocks).
For testing, it adds a --blob option on minimal-secure-streams[-client]
which uses a streamtype "bulkproxflow" from here
https://warmcat.com/policy/minimal-proxy-v4.2-v2.json
"bulkproxflow": {
"endpoint": "warmcat.com",
"port": 443,
"protocol": "h1",
"http_method": "GET",
"http_url": "blob.bin",
"proxy_buflen": 32768,
"proxy_buflen_rxflow_on_above": 24576,
"proxy_buflen_rxflow_off_below": 8192,
"tls": true,
"retry": "default",
"tls_trust_store": "le_via_dst"
}
This downloads a 51MB blob of random data with the SHA256sum
ed5720c16830810e5829dfb9b66c96b2e24efc4f93aa5e38c7ff4150d31cfbbf
The minimal-secure-streams --blob example client delays the download by
50ms every 10KiB it sees to force rx flow usage at the proxy.
It downloads the whole thing and checks the SHA256 is as expected.
Logs about rxflow status are available at LLL_INFO log level.
This provides a way to get ahold of LWS_WITH_CONMON telemetry from Secure
Streams, it works the same with direct onward connections or via the proxy.
You can mark streamtypes with a "perf": true policy attribute... this
causes the onward connections on those streamtypes to collect information
about the connection performance, and the unsorted DNS results.
Streams with that policy attribute receive extra data in their rx callback,
with the LWSSS_FLAG_PERF_JSON flag set on it, containing JSON describing the
performance of the onward connection taken from CONMON data, in a JSON
representation. Streams without the "perf" attribute set never receive
this extra rx.
The received JSON is based on the CONMON struct info and looks like
{"peer":"46.105.127.147","dns_us":596,"sockconn_us":31382,"tls_us":28180,"txn_resp_us:23015,"dns":["2001:41d0:2:ee93::1","46.105.127.147"]}
A new minimal example minimal-secure-streams-perf is added that collects
this data on an HTTP GET from warmcat.com, and is built with a -client
version as well if LWS_WITH_SECURE_STREAMS_PROXY_API is set, that operates
via the ss proxy and produces the same result at the client.
This was commented during the metrics patch for some reason...
commenting it breaks UDS -> web serving proxying.
Uncomment it and see what the other problem is..
This provides a build option LWS_WITH_CONMON that lets user code recover
detailed connection stats on client connections with the LCCSCF_CONMON
flag.
In addition to latencies for dns, socket connection, tls and first protocol
response where possible, it also provides the user code an unfiltered list
of DNS responses that the client received, and the peer it actually
succeded to connect to.
Really not having any logs makes it difficult to know what is really
happening, but if that's you're thing this will align debug and release
modes to just have ERR and USER if you give WITH_NO_LOGS
It's perfectly possible we will have destroyed the wsi and report that
back in the return code. So let's not dumbly defreference the wsi to
make a log inbetweentimes.
Found with fault injection and valgrind.
In the case that we try ipv6 that isn't routable, we get a POLLHUP, that
marks the wsi as unusable (for writes, not pending reads), that's what
we want.
But in the case we go around and retry other dns results that are
routable, we have to clear the wsi unusable flag. Otherwise we will
connect and find that we can't write on the connection...
If the DNS lookup fails, we just sit out the remaining connect time.
The adapts it to reuse the wsi->sul_connect_timeout to schedule DNS lookup
retries until we're out of time.
Eventually we want to try other things as well, this is aligned with that.
Found with fault injection.
There are a few build options that are trying to keep and report
various statistics
- DETAILED_LATENCY
- SERVER_STATUS
- WITH_STATS
remove all those and establish a generic rplacement, lws_metrics.
lws_metrics makes its stats available via an lws_system ops function
pointer that the user code can set.
Openmetrics export is supported, for, eg, prometheus scraping.
For SMP case, it was desirable to have a netlink listener per pt so they
could deal with pt-level changes in the pt's local service thread. But
Linux restricts the process to just one netlink listener.
We worked around it by only listening on pt[0], this aligns us a bit more
with the reality and moves to a single routing table in the context.
There's still more to do for SMP case locking.
Also prioritize LD_LIBRARY_PATH check for plugins first
Iterate through paths in LD_LIBRARY_PATH in order
Warn on failed plugins init but continue protocol init
On h2 server POST, there's a race to see if the POST body is going to be
received coalesced with the headers.
The problem is on h2, we can't action the stream http request or body until
the stream is writeable, since we may start issuing the response right away;
there's already DEFERRING_ACTION state to manage this. And indeed, the
coalesced, not-immediately-actionable POST body is buflisted properly.
However when we come to action the POST using buflisted data, we don't follow
the same pattern as dealing with the incoming data immediately.
This patch aligns the pattern dumping the buflist content to track
expected rx_content_length and handle BODY_COMPLETION if we got to
the end of it, along with removal from the pt list of wsi with pending
buflists if we used it up.
If the server is very close in rtt to the client, the server
hangup may get processed before buffered rx.
Make sure we clear buffered rx before dealing with the HUP.
The various stream transitions for direct ss, SSPC, smd, and
different protocols are all handled in different code, let's
stop hoping for the best and add a state transition validation
function that is used everywhere we pass a state change to a
user callback, and knows what is valid for the user state()
callback to see next, given the last state it was shown.
Let's assert if lws manages to violate that so we can find
where the problem is and provide a stricter guarantee about
what user state handler will see, no matter if ss or sspc
or other cases.
To facilitate that, move the states to start from 1, where
0 indicates the state unset.
Let's allow the proxy to pass back what the policy says about
the size of dsh buffer the client side of this streamtype
should have.
Defer clientsize dsh generation until we got the info back
from the proxy in the response to the initial packet. If
it's zero / unset in the policy, just go with 32KB.
This is a huge patch that should be a global NOP.
For unix type platforms it enables -Wconversion to issue warnings (-> error)
for all automatic casts that seem less than ideal but are normally concealed
by the toolchain.
This is things like passing an int to a size_t argument. Once enabled, I
went through all args on my default build (which build most things) and
tried to make the removed default cast explicit.
With that approach it neither change nor bloat the code, since it compiles
to whatever it was doing before, just with the casts made explicit... in a
few cases I changed some length args from int to size_t but largely left
the causes alone.
From now on, new code that is relying on less than ideal casting
will complain and nudge me to improve it by warnings.