In one of our last video posts we presented how to extract metadata from images and save to CSV (you can change output format to JSON just by setting property: output.format
to json
). We also showed integration with Digital Ocean Spaces. Now we want to present how you can ingest results of Metadata Digger work to Apache Solr. If you are not familiar with this powerful Full Text Search engine, please read our previous post with introduction to Solr.
Metadata Digger uses Solr 8.2 but it works with other versions as well (we didn’t test it but it should also work for 6.x, 7.x and for sure 8.x). You can install Solr manually or use our Docker image prepared for development and testing purposes. Remember that if you want to index many documents, you will have to tune this image.
Starting Solr
In this post we will use a Docker image, so at first please install it (and additionally docker compose): https://docs.docker.com/get-docker/. If you decide to install Solr manually, you have to remember about adding a schema, config and collection which you can find in our repository. If you chose a Docker image, follow the steps from dev/README.md on metadata-digger-deployment repo:
- Download the whole project using
git clone git@github.com:data-hunters/metadata-digger-deployment.git
or just download a zipped version: https://github.com/data-hunters/metadata-digger-deployment/archive/master.zip - Go to
dev
directory inside the project. - Start Solr with this command (it will take a while for the first time):
docker-compose up upload_config
- Go to http://localhost:8983/solr
Now you should see something like this:
Adding collection
Our image has a schema and config uploaded to Solr, so the only thing you have to do is to add the collection which is pretty simple:
- Go to Collections and click “Add Collection”
- Fill a form setting collection to “metadata_digger” (you can choose a different name but remember to update MD properties) and config set to “metadata_digger”
- We use an existing config set (containing
managed-schema
andsolrconfig.xml
files). It was prepared by us and uploaded to ZooKeeper. - Select a newly created collection in a dropdown list and click on Query link.
- Now you should see a form for sending search queries to Solr.
- Submit the form. Now you should see JSON response on the right with information about 0 results.
Configuring Metadata Digger
At this point we need to extract and index metadata from images. If you haven’t downloaded Metadata Digger yet, you have to do this now:
- Go to https://github.com/data-hunters/metadata-digger/releases
- Download a standalone version (0.2.0) and unzip
- Go to directory
metadata-digger-0.2.0_standalone
and open the following file in your favorite text editorconfigs/solr.config.properties
.
Please set the following properties:
input.paths
– path to directory containing your images.processing.maxMemoryGB
– set a real value, you shouldn’t use a value greater than your actual memory.output.collection
– leave unchanged, unless you used different collection name during creation.output.solr.conversion.dateTimeTags
– you should inform MD which fields should be adjusted to aligning Solr DateTime format. In case of our Schema it will be:md_exif_ifd0_datetime,md_icc_profile_profile_datetime,md_gps_md_datetime,md_exif_subifd_datetime_original
output.solr.conversion.integerTags
– similar here but for integer values. Metadata can have everything, so MD will take care of cleansing and adjusting the value to avoid errors on Solr side. Our schema requires the following fields to be defined here:md_jpeg_image_width,md_jpeg_image_height,md_exif_subifd_exif_image_width,md_exif_subifd_exif_image_height,md_gps_gps_satellites
processing.thumbnails.enabled
– if you leave it (true
), MD will generate thumbnails and persist it in Solr. Set tofalse
unless you really need it.
Additional configuration (optional)
You can leave this section if you don’t want to go into details about tunning Metadata Digger Solr Writer.
output.zk.servers
– ZooKeeper servers in the following format:SERVER1:PORT1,SERVER2:PORT2,SERVER3:PORT3
output.zk.znode
– ZNode dedicated for Solr. ZooKeeper keeps data in a tree structure similar to the file system. If you share ZK between more than one services (in our situation it’s only Solr), you should create znode like/solr
and point it in Solr configuration.output.columns.metadataPrefix
– a prefix that will be added to all metadata fields extracted from files. It’s good to have it, if you use Solr as an output because you can define dynamic fields for all metadata using a single definition like:<dynamicField name="md_*" type="text_gen_sort" indexed="true" stored="true" />
.output.columns.namingConvention
– MD supports two naming conventions:snakeCase
andcamelCase
. If you choosesnakeCase
, tag “Image width” will be converted to “image_width”. ForcamelCase
, you will have the following field: “ImageWidth”.output.columns.includeDirsInTags
– Metadata in a file are organized into groups called directories like Exif IFD0 or JPEG. During the extraction MD can keep those names or not. If we set naming convention tosnakeCase
, prefix tomd_
and set this property totrue
, you will have the following field for “Image width” tag (JPEG directory):md_jpeg_image_width
.processing.thumbnails.enabled
– MD can generate thumbnails (medium and/or small) from images and store them in Solr. You should use a binary type in definition of those Solr fields.processing.thumbnails.mediumDimensions
– size of medium thumbnail in a format:<IMAGE_WIDTH>x<IMAGE_HEIGHT>
. MD will put a generated image intomedium_thumb
field. Leave it empty or remove the whole property, if you don’t want to have it.processing.thumbnails.smallDimensions
– the same as above but for a small thumbnail. Output will be stored in asmall_thumb
field.
See more information on our github repo:
- https://github.com/data-hunters/metadata-digger#apache-solr
- https://github.com/data-hunters/metadata-digger#generating-thumbnails
Launching MD!
Solr is up, configuration is ready, time for extraction and indexing. Run the following command:sh run-standalone-metadata-digger.sh extract configs/solr.config.properties
You should have similar results to the following:
As you can see there are two warnings:
2020-05-15 23:33:09 WARN Extractors$Transformations$:98 - Error occurred during metadata extraction for image: file:/data/metadata-extractor-images/README.rst (Message: File format could not be determined). Ignoring file...
2020-05-15 23:33:09 WARN Extractors$Transformations$:98 - Error occurred during metadata extraction for image: file:/data/metadata-extractor-images/README (Message: File format could not be determined). Ignoring file...
There were two text files in my directory and that was the reason of above warnings. It won’t stop processing but for performance reasons it’s better not to have other types of files in the input directory than images.
You can also experience another message: “Cannot parse value of tag: X“. It means that for some reason MD couldn’t determine the value of this tag. In most cases it is a custom format, specific for some device. If you noticed such a situation, you can create a ticket on github (and attach a sample image if possible). We will check it.
Time for search!
Let’s check how many documents (representing files) have been indexed. Go to http://localhost:8983/solr and select metadata_digger
collection (left menu, under “Suggestions” link), then click on “Query” and you will have the main page for searching your documents. Default query is: *:*
which means “everything” for Solr. Use it without changing and click “Execute query” button.
Results will be displayed on the right side of the page in JSON format. Now you can check how many documents are stored in your collection. It will be “numFound” attribute. For me it’s 414 as you can see on the image below:
Now you can try with different keywords. Solr will search through all metadata extracted from your images. It’s because we prepared a special field (not visible in results) where we put values from all meta tags. Solr config for this collection is prepared to search using this field by default (you can change it of course).
As you probably noticed, you don’t have all results on a single page. That’s because Solr has a default limit set to 10. You can navigate through results using two parameters:
start
– number of first document (from the result set) that should be displayedrows
– number of documents that should be displayed
Faceting – basic stats
To get some overview about your data we can ask Solr for basics stats. Assuming you want to check all devices used to take a photo, we can use field md_exif_ifd0_model
and Faceting feature.
To do that, follow the steps below:
- Click on “facet” checkbox.
- Paste
md_exif_ifd0_model
field. - Execute a query.
- Scroll down to the bottom of results. You should see
facet_counts
attribute/object.
It will be something similar to this:
As you can see, the mentioned field (md_exif_ifd0_model
) is a list of key-value pairs. It’s a bit weird format because instead of a list of JSON objects (like: { "value": "COOLPIX P1", "count": 3 }
) we have a single list (ordered descending by count): value1, count1, value2, count2, etc. The main reason for such a weird format is probably a need for optimizing parsing process. The most important thing is that we can learn from this list two things:
- Names of all devices used to make images
- Number of images/documents with the same Device model.
Facets results are strictly bound to query parameters. Let’s suppose you want to search by “canon” to narrow results. Type the keyword in q
field, run query and go to facet_counts
. In my case it looks like this:
Now, the mentioned list looks a bit different as we have only Canon models on the top. Other models will be also included but with 0 value as there are no images in results for “canon” keyword containing, for instance “iPhone 6” value in md_exif_ifd0_model
field.
Remember about default limits for number of results returned by Solr. It’s also applicable for faceting, not only for the main result list, but here we have a higher default value – 100. You can change it using facet.limit
parameter.
More search
You can now play with Solr and check various features of this powerful platform. You can find more information about different search options on official site: https://lucene.apache.org/solr/guide/8_2/searching.html. When you finish, you can close Docker container with ctrl+c
. Solr saves data in sc_data
directory, so next time, you can just run Solr without indexing all images again using command: docker-compose up solrcloud
.
In the following posts, we will show more advanced aspects of searching.
Follow our page if you want to keep up with new posts 🙂