New file hash bites extractor

I'm pushing this out a bit early but the covid stuff is so depressing that I could use a pick-me-up. I'm giving myself the prize of releasing the first publicly available bites extension. It's a bit early because I'm having some problems getting tests to run. I keep getting

[main] INFO com.complexible.common.LinuxDistribution - Determined Linux distribution from file '/etc/os-release': CentOS 7.0 (7)
[main] INFO com.stardog.starrocks.NativeStorageKernel - --> file:/home/centos/git/kibbles-string/target/test-classes/
[main] INFO com.stardog.starrocks.NativeStorageKernel - --> file:/home/centos/git/kibbles-string/target/classes/
[main] ERROR com.complexible.stardog.BaseStardogModule - System information is invalid and the repair failed: Unexpected native ordinal: 130818

I have no idea what the problem is but I'll figure it out later. They seem to work though. These extractors compute the following hashes for your bites files. Md5, Sha1, Sha256, Sha384, and Sha512

I figured these might be handy to use with the reasoner to do exact file de-duplication since I don't think bites storage is content based. (I get why that is, because you'd have to do some sort of reference counting)

Installation: drop the jar into the $STARDOG_EXT directory and restart Stardog
Usage: stardog doc put --rdf-extractors Sha1 test bites-test.txt

1 Like

I award Full Hero Points here... but can you throw something into the README.md so I know what it is?

Cheers,

Kendall

1 Like

"Extractor hash" isn't enough? :slight_smile:

This should be a little better https://github.com/semantalytics-stardog/kibbles-bites-hash/blob/master/README.md

I promise I'll clean it up eventually. The embarrassment of having it publicly available will motivate me to actually do it.

I've got lots of other ideas to try too. I'd like to extract image perceptual hashes (phash) althought that really needs a custom index which I'd like to do too. Maybe some more detailed image metadata extraction, although that is somewhat more complicated than it might seem. I was looking at this library https://github.com/drewnoakes/metadata-extractor It provides output in a strange semi-structred format which seems like it would be perfect for rdf.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.