A new datastructure is about to take dlhist place. dlhist is currently
implemented as a mixture of an "process cache" that should record what
rss items has been processed (that is why the url is used as a unique
identifier), but right now it only stores an url if it has been
downloaded. A new datastructure that should be "download history"
shall be implemented, that will keep track of what title and where
it has been downloaded to. this will make it possible to only
download an rss title to a location once.
Splitting this datastructure into two separated structures is trivial
as a "process cache" will threat URL's as a unique identifier and
a "download history" will threat the title in an rss item as a
unique identifier (and also track it's destinations).
This commit does not change any functionality, I just rename
this to keep the "dlhist" prefix and source files clear for
when implementing the real dlhist.
Now that table size can be calculated, lets store the number of entries
instead of size in the header so we can rely on that when reading
entries, instead of the actual size on disk. this is safer if data is
appended to the file outside of the application.
Somehow I apperently missed to do linear probing in he_insert that
results in colliding entries read from file (and when resizing)
to be dropped on the floor.
Lets not drop things on the floor anymore, certainly there is
another place in the table that will do fine instead of just
giving up and throw it on the floor.
info->msg is being assigned to 'error'. but there is no such variable.
altho there is such a function in error.h
fix this by assigning info->msg to 'err' instead, that is the variable
passed to pcre_compile().
use sha1 hashes instead of c-strings to make records fixed size.
because it's hard to find collisions in sha1 hashes, this works well
in practise. And dynamic memory allocation for the variadic size keys
is not needed anymore. space is also reduced due to most key strings being
more than 20 bytes long.
calculating sha1 should be fast enough to not make any more overhead
than dynamic memory allocation did.
When going through the filter list for an item, we download and store the item
everytime a filter is matched.
This patch allowes an item to be downloaded the first time a filter
match and save the data throughout the rest of the list, so all
other matches never downloads the item again but uses the data in memory.
Sometimes, you want to fetch a file in memory so you can
store it on multiple places on disk whitout having to download it
again or copy files. while http_fetch_page works for fetching data
in memory, the possible filename found in the 'Content-Disposition'
header-feild is not accounted to.
http_fetch_file() fetches the data and store it in memory while trying to
get ahold of the filename.
Better to have an destination for every filter then every target.
otherwise one will have to have two targets with the same source but
different destination.