Not a huge performance loss anyway. And the directory may be deleted
between calls so. However the directory can still be missing as soon
as the mkdir() call has ended.
Doing it this way just means that if the directory is removed during
execution of the program, it will be created again and not make the
rest of the programs lifetime live without it.
using the pointer 'b->block' when it is possible that
reallocation has moved the memory to another location.
'b->block' may therefore be an invalid pointer in some
cases. use 'ret' intead.
Use the refactored code from hash.c also
use chaining as the collision strategy instead of
open-adressing, not only becouse the new hash api makes it hard
to do but it is more space efficient.
Since a collision with open-adressing results in two entries
in the hash table but with chaining, we only have one.
the complexity for search/insert/delete is still O(n) for both techniques.
Chaining is better because items that collide only takes up one slot in the
hash table, considering that the best-case for space overflow is 25%. it
is better to have a small table.
flush() is redundant, it makes more sense to just write the file on close().
There is no reason why you want to commit the current state of the cache to disk
at any other time then when closing the application.
with SHA1 as a CRC mechanism.
When writing file formats using SHA1 as CRC, its is handy to
have SHA1_Update() to be applied to every write(). so that an
SHA1 hash can be calculated for that data and used as an CRC check.
Therefor this interface is created to wrap the code used to do this.
A new datastructure is about to take dlhist place. dlhist is currently
implemented as a mixture of an "process cache" that should record what
rss items has been processed (that is why the url is used as a unique
identifier), but right now it only stores an url if it has been
downloaded. A new datastructure that should be "download history"
shall be implemented, that will keep track of what title and where
it has been downloaded to. this will make it possible to only
download an rss title to a location once.
Splitting this datastructure into two separated structures is trivial
as a "process cache" will threat URL's as a unique identifier and
a "download history" will threat the title in an rss item as a
unique identifier (and also track it's destinations).
This commit does not change any functionality, I just rename
this to keep the "dlhist" prefix and source files clear for
when implementing the real dlhist.
Now that table size can be calculated, lets store the number of entries
instead of size in the header so we can rely on that when reading
entries, instead of the actual size on disk. this is safer if data is
appended to the file outside of the application.
Somehow I apperently missed to do linear probing in he_insert that
results in colliding entries read from file (and when resizing)
to be dropped on the floor.
Lets not drop things on the floor anymore, certainly there is
another place in the table that will do fine instead of just
giving up and throw it on the floor.
info->msg is being assigned to 'error'. but there is no such variable.
altho there is such a function in error.h
fix this by assigning info->msg to 'err' instead, that is the variable
passed to pcre_compile().
use sha1 hashes instead of c-strings to make records fixed size.
because it's hard to find collisions in sha1 hashes, this works well
in practise. And dynamic memory allocation for the variadic size keys
is not needed anymore. space is also reduced due to most key strings being
more than 20 bytes long.
calculating sha1 should be fast enough to not make any more overhead
than dynamic memory allocation did.
When going through the filter list for an item, we download and store the item
everytime a filter is matched.
This patch allowes an item to be downloaded the first time a filter
match and save the data throughout the rest of the list, so all
other matches never downloads the item again but uses the data in memory.
Sometimes, you want to fetch a file in memory so you can
store it on multiple places on disk whitout having to download it
again or copy files. while http_fetch_page works for fetching data
in memory, the possible filename found in the 'Content-Disposition'
header-feild is not accounted to.
http_fetch_file() fetches the data and store it in memory while trying to
get ahold of the filename.