In the last post I showed how to extract metadata from images, write to Solr (using Metadata Digger) and make some basic search. This time I want to explain more advanced querying options. In first part we’ll go through the following topics:
- I’ll explain a little bit what is Query Parser;
- We’ll narrow results to documents containing GPS data and learn about Boolean queries;
- We’ll also use GPS Satellites tag and Image Width/Height in our search;
- In final step we’ll use Spatial search to select images taken in some area (in our example – 50 km from the center of Berlin).
In next part of this tutorial, we’ll play with DateTime fields, find a way of searching when we don’t know/remember exact device model and learn how to display only some fields. I’ll also explain main difference between q
and fq
parameters.
Query Parser
Every search engine needs to translate an input from a user to understand which information should be retrieved and returned. I don’t want to explain all Solr internals but it’s important to know that a core component responsible for it, is Query Parser. Why? Because Solr provides three main built-in parsers and each of them provides a bit different syntax and search capabilities. Changing a parser is very easy, just set a name to defType
query parameter. These are three main parsers:
- Standard Query Parser – a default parser; sufficient in many cases but it’s very intolerant for syntax errors, so using this parser may be sometimes frustrating.
- DisMax Query Parser (use
defType=dismax
) – provides syntax similar to Google and handles errors in more user friendly way. However, it has two sides – sometimes it’s better to have a clear error than not to know what is the exact reason of specific search results. - Extended DisMax Query Parser (use
defType=edismax
) – improved and extended DisMax.
Above parsers provide the main interface between a user and Solr but there are also other parsers responsible for specific tasks, like Graph or Spatial. Full list is available here: https://lucene.apache.org/solr/guide/8_2/other-parsers.html#other-parsers.
Solr is built with an idea of customizability, so if you want to build your own parser, you can do it by extending QParserPlugin
class in Java. However, it’s not a trivial task and won’t be explained today.
Geotagging
Currently many softwares don’t include geotags by default due to privacy reasons. Even if they’re included, it could be removed by some service (e.g Facebook). However it’s still possible to find such images on the Internet. You just have to search in the right places 😉 In the end you will probably have a tons of images and only some of them will have geotags. It would be like looking for a needle in a haystack. MD can help you filter out not interesting (in this case) images. You have at least two options here:
- Ignore images without geotags on MD extraction level by using mandatory tags: https://github.com/data-hunters/metadata-digger#setting-up-mandatory-tags.
- Index all images to Solr with MD and then use filtering queries.
Let’s try with the second option.
Finding images with GPS information (filtering/boolean)
If you are good searcher or you were just lucky enough to find images with GPS meta tags you need to find the interesting ones with Solr queries. You can do this in different ways. Let’s start with the easiest one:
- Go to Query form http://localhost:8983/solr/#/metadata_digger/query.
- Type
tag_names:gps
inq
field. - Run a query.
Solr returned all documents which have “gps” phrase in tag_names
field. This field contains a list of all metatags (no values, just names). Sometimes you just want to filter out files which don’t contain specific tag. In that case you can use the above query. Let’s go a little bit deeper here. Default Solr parser provides Boolean operators, so you can build complex field queries. Supposing you want to find documents meeting all of the following criteria:
- We want to have a directory (main meta group like Exif IFD0) containing phrase “gps”.
- There needs to be a tag with exact name “Image width”
- There has to be a tag with a name containing “gps latitude” (but it should also match “GPS Latitude Ref”) or “gps longitude”
- It cannot have a tag with name “Thumbnail pixels”
We can build the following query:
directory_names:gps AND tag_names:(("image width") AND ((gps latitude) OR (gps longitude)) AND -("thumbnail pixels"))
As you can see, we specified two field queries joined with AND
operator. Default operator is OR
, so if you remove AND
Solr will interpret lack of operator as OR
. A more interesting query is on the right for tag_names
field:
- We used an exact match (“”) for “image width” tag as we want to inform Solr not to look for different combinations of those words.
- There are brackets for building a subquery related to gps latitude and longitude. We can simplify it, as space is interpreted as
OR
, you can just type:(gps latitude longitude)
instead of:((gps latitude) OR (gps longitude))
but I wanted to show a very simple example of subqueries. - The last part is negation. We want to exclude from results all documents with tag “Thumbnails pixels”. Remember that it will also remove from results all documents just containing a mentioned phrase.
If you are familiar with “Google Dorks”, “X-Ray”, etc. above syntax is nothing new for sure. Let’s go further and find pictures taken in some area.
Find images taken with poor GPS satellites coverage (numbers, ranges)
If a picture was taken with GPS metatags, there are chances (unfortunately small, but still) that the device also provided a GPS Satellites meta tag. It contains information about a number of satellites used for determining geolocation. MD indexes it in md_gps_gps_satellites
field. Let’s suppose we want to find images taken with less than 4 satellites. You can just add to your query: md_gps_gps_satellites:[1 TO 3]
. It will narrow results by selecting documents with md_gps_gps_satellites
in a range between 1 and 3 (inclusive). Remember about adding AND
/OR
, if it’s a part of bigger query.
If you don’t want to specify upper or lower bounds (and just get all values), you can use *
, so you can change the above example to [* TO 3]
.
When you use ranges, like in the example above, you can:
- include left and right values – use
[
and]
, - exclude – use
{
and}
, - mix both types of brackets
Filtering by image dimensions
MD provides four other numerical fields. These are related to the width and height of an image:
- Width and height from JPEG directory:
md_jpeg_image_width
,md_jpeg_image_height
- The same but from Exif SubIFD0 directory:
md_exif_subifd_exif_image_width
,md_exif_subifd_exif_image_height
You can use those fields to filter results in the same way as we did for satellites. If you want to have results for images with the following conditions: width >= 2000 and height >= 3000, you can build a query like this:
md_exif_subifd_exif_image_width:[2000 TO *] AND md_exif_subifd_exif_image_height:[3000 TO *]
Remember that those are just metadata added by a human or software. It can represent an actual image dimensions or not.
Spatial search
There are two parsers for spatial search: geofilt and bbox. Both allows for defining a circle area (point – location and radius) and finding documents having geolocation within it.
Let’s assume you want to find all photos taken in the area of 50 km from the center of Berlin. At first you should somehow get the latitude and longitude of the city. There are many tools and services providing it like Google Maps, Open Street Map, etc. A rough value for the center of Berlin is 52.507,13.285. Now we can prepare a query. The final query should look like this:
{!geofilt sfield=md_gps_md_location pt=52.507,13.285 d=50}
Please put it into fq
parameter and run. In next part of this tutorial I’ll explain why.
As you can see, it’s pretty easy.
At the beginning of this post I mentioned that Solr provides three main parsers but also some taks-specific. Geofilt is the second one. When you want to use it, you have to adhere the following format:
{!<PARSER_NAME> <PARAM1>=<VAL1> <PARAM2>=<VAL2>}
More information about both parsers you can find here.
Next steps
It was first part of my tutorial about more advanced options of searching with Solr. We’ll continue this topic in next post. Follow me on Twitter for latest updates – @jca3s!