In last post we mentioned about Metadata Digger. It’s a tool for extracting and analyzing metadata from huge amounts of images. As we are using Apache Spark, it is prepared to run on multiple machines to make a whole processing faster and what is the most important – possible to run. Imagine you have hundreds of terabytes or even petabytes of images. It might be a kind of chaos. You have been crawling those data from multiple sources and of course you have some information about the origin but is it really an origin? Let’s think for a while about it and ask some other questions:
- Maybe someone has copied an image from some Facebook profile and pasted it on a forum or somewhere else? How this could be detected? There are tons of images.
- Most services (especially social media) are currently removing original metadata from an images but there are some websites/forums/applications that still keep it. How to find images containing useful metadata without additional manual work and infinite time?
- Maybe you have some image with interesting metadata and want to find similar from your large dataset?
- Maybe you just want to put all those data to some processing black box and search the whole dataset by some keywords?
- What about actual content of images? It would be nice to enrich metadata with information about what is in the picture, right?
Metadata Digger can help you solve such problems. How? Let’s try to answer:
- Even if some service removed metadata, it left new one coming from an image processing library. Sometimes it also leaves additional information that could help with identifying the service. You can extract metadata from your dataset with MD and automatically index it to Apache Solr – full text search engine. The next step is to use filtering queries to select those images. You can also make whole process simpler by just putting a list of mandatory meta tags. MD will ignore all the files which don’t contain those tags. Result could be saved to CSV, JSON or index to Solr.
- This problem is similar, so you can extract metadata and index to Solr. When you have it in a full text search engine it will be easy to first make some stats about distinct values for particular meta tags (use Solr faceting feature) and then build a query that will narrow results to interesting data.
- This problem could be easily solved with MD. We implemented a special feature that allows you to find images with similar metadata (selected by you) for some image. Result could be exported to CSV, JSON or index to Solr for further analysis. A good example is finding images created with the same device’s model or by the same author or with similar geolocation.
- We have mentioned that MD allows to index results into Solr – popular, scalable full text search engine. It is a stable, Open Source technology used by many big companies. You can control which data is indexed by setting mandatory tags or by passing a list of output meta directories (Exif, GPS, etc.). If you just want to search and don’t need any intuitive search app with visualizations, you can use Solr Admin UI (shipped with Solr). You can also build your own application on top of Solr or integrate with existing solutions like Apache Hue dashboards or Zeppelin (we will write about it in following posts).
- Ok, but what about actual content? In the last version MD was integrated with Intel’s framework for Deep Learning – Analytics Zoo and BigDL. They are built on the top of Spark. Thanks to our integration you can just put a path to your Deep Learning model, specify labels mapping and run MD on your data set. In turn you will have additional field: labels containing a list of categories recognized by the model. We have trained a sample model on popular COCO data set containing 80 general categories. It’s not perfect but gives quite good results.
We are not going through all above problems in details here, but let’s start with some basics.
Processing really big data has no sense on a common laptop. You need a cluster to do it. What is a cluster? It’s a set of servers managed by clustering software which allows to run some calculations on all machines (or some parts) at the same time with control over resources like CPU and RAM. In case of Spark which is a core part of Metadata Digger, you have the following options regarding clustering system:
- Spark Standalone Cluster (don’t mistake with MD Standalone mode)
- YARN – Hadoop clustering system
- Mesos
- Kubernetes (experimental)
Let’s forget for a while about those complex things. If you want to start and test on some limited data set, we have prepared a special version of MD that could be run on your machine without a cluster. You just need to have Linux, Java 8 (JRE or JDK) and some reasonable amount of RAM. MD scales well on multiple cores as well as on a single machine so for testing purposes you can run it locally. By default MD leaves 1 core for OS and uses all the others (you can control it with configuration properties).
If you are experienced with Big Data things, you have a cluster or just want to run MD on some cloud-based services providing Spark, you should use MD Distributed version.
Let’s start with a basic extraction process using Standalone version (could be run anywhere on Linux, including your laptop):
- Go to https://github.com/data-hunters/metadata-digger. This is the main MD page with quite detailed documentation. It will be useful for adjusting whole process.
- Click on Releases tab and download metadata-digger-0.2.0_standalone.zip
- Unpack it:
unzip metadata-digger-0.2.0_standalone.zip
- Go to standalone directory and you will see configs directory with sample configuration files. Open csv.config.properties and adjust properties by setting
input.paths
to path to directory with your images (it could be comma separated list) andoutput.path
to path where MD will create directory with results. There is also one another property (filter.mandatoryTags
), let’s comment it out (with # at the beginning of a line) or remove for now. - As you can see, there are other properties in the file like:
input.storage.name
andoutput.storage.name
– for testing you can use a file which is a local file system. For bigger data set you should use some distributed file system like HDFS or S3 (https://github.com/data-hunters/metadata-digger/tree/v0.2.0#reader-configuration)output.format
– possible values: csv, json or solr (use solr.config.properties if you want to try Solr because additional properties are needed)processing.maxMemoryGB
– number of megabytes that will be assigned for MD. Spark likes memory, it’s not lightweight technology.
- Run the following command:
sh run-standalone-metadata-digger.sh extract <PATH_TO_CONFIG>
Now you can see something like this:
Please go to the output directory and check the results. If you used sample csv.config.properties you should have one file produced by MD in CSV format. Basically, if you don’t set output.filesNumber
property, the number of output files will depend on a size of data set. During load (and processing) Spark creates packages of data to process them in parallel. Each package is called partition and produces one output (in our case it is final CSV/JSON file). You can control this value by a mentioned property (output.filesNumber
) but be careful here, especially if you run on huge data set in distributed mode because Spark has to move results from all partitions to one place in memory and then save them to a file. It can cause Out of Memory issues.
Okay, it’s all for today. I hope it was easy for you to start with Metadata Digger. In following posts we will write about using different file systems (HDFS and S3), filtering output of MD, using Deep Learning models for detecting what is on the image, etc.
I hope all the information was clear and helpful (in your research). Stay alert and await our next post 🙂