Scrapping and transcribing TikTok videos with Python

TikTok has been attracting a lot of attention of OSINT investigators in recent years. In this article, I will show how to implement a Python application which downloads TikTok videos and converts speech to text using OpenAI model.

TikTok as a data source

There is a number of reasons why social media are a valuable source of data for investigators. Finding information about individuals is one of them. Since the Russian invasion, TikTok has played a significant and visible role in terms of Information Warfare.

It’s much easier to gather and analyze data from text based services like Twitter to detect disinformation than to do so based on other social media channels which mostly rely on videos as their main source of content. Before we start any massive offline search or analysis, we have to download videos and process them to retrieve useful information.

A lot can be done with videos, e.g.:

  • Extracting sound and converting speech to text. In further steps, results could be indexed in one of the Full-Test Search engines.
  • Extracting text from video frames.
  • Detecting objects (buildings, people, cars, etc.) on particular frames.
  • Detecting movements of objects between frames.
  • Detecting if a video and/or sound is deep fake.

I have written a simple Python application which gathers videos, and then it converts speech to text using the OpenAI model. Let’s see how it works!

TikTok Analyzer

The application is available on Github: https://github.com/data-hunters/tiktok-analyzer. In this section, I will show how to use it. If you are a programmer, go to the next sections, where we will go through the most important parts of the code.

Let’s start from the most important library – TikTok scrapper. I used an existing library – TikTokPy. The main problem with scrapping TikTok is that the official API is very limited. It is a never ending race between changing TikTok’s interface and Open Source contributors, who are constantly updating the code and are looking for some anti-bot detector’s workarounds. If you search for TikTok scrappers on Github/Google, you will find plenty of apps/scripts, but some (or most of them) won’t work. At the time of writing this article, TikTokPy was working properly, so I hope it also works now ;).

One thing I’ve noticed is the occurrence of 403 errors (while downloading a video) from time to time. It could be related to TikTok anti-bot mechanisms.

Setup

At the beginning, clone repository:

git clone https://github.com/data-hunters/tiktok-analyzer.git

or download ZIP.

Before you run the application, install required libraries:

pip install tiktokapipy
python -m playwright install
pip install whisper-openai

Downloading videos by hashtags

To download 10 latest videos (and soundtrack) by hashtag ukraine to tiktok_videos directory, run this:

python run.py --hashtag ukraine --output-path tiktok_videos --max-videos 10

Each file has the following name format: “<username>_<video_id>.<format>” where the format is mp3 or mp4.

Downloading users’ videos

To download 10 latest test123 user videos (with a sound track) to tiktok_videos directory, run the following:

python run.py --user test123 --output-path tiktok_videos --max-videos 10

Speech to text

Converting sound from the video to text is a separate process that could be run in a new command. You need to pass an input directory, and TikTok Analyzer will process all mp3 files within this directory. There is also one additional parameter that you can specify – the name of the OpenAI model. Default value is base. If it’s not sufficient (in terms of the quality of the output transcription), you check a bigger model. The full list is available here: https://github.com/openai/whisper#available-models-and-languages.

The following command runs Analyzer on files from tiktok_videos directory, saves results to tiktok_transcription directory and it uses medium model.

python run.py --transcribe --input-path tiktok_videos --output-path tiktok_transcription --model medium

Putting all the pieces together

You can also utilize all the features of Analyzer with a single command. It will go through the steps in the following order:

  1. Scrapping videos by hashtag.
  2. Scrapping users’ videos.
  3. Converting speech to text.

If we want to combine all the commands used, it will look as follows:

python run.py --hashtag ukraine --user test123 --max-videos 10 --transcribe --input-path tiktok_videos --output-path tiktok_videos --model medium

Implementation

Let’s go through the most important parts of the code.

Scrapping

The whole code responsible for downloading videos (by user and hashtag) is located in ttanalyzer/scrapper.py file. Like I mentioned at the beginning, I used the TikTokPy library for scrapping purposes. It doesn’t use the official TikTok API, so no credentials are needed. You can just import TikTokAPI class:

from tiktokapipy.api import TikTokAPI

build an api object and call a challenge endpoint to retrieve the most popular videos by hashtag. As you can see below, you can limit the number of downloaded videos. You can ignore it and the scrapper won’t set any limit. It doesn’t mean that it will download all videos with a particular hashtag. TikTok has anti-bot mechanisms and it will probably block you if you don’t use this in a reasonable way.

with TikTokAPI() as api:
    videos_wrapper = api.challenge(hashtag, video_limit=video_limit)

Returned object (videos_wrapper) contains videos field which is a list of objects providing information about particular videos. I encourage you to run debugger and see what is inside each object. TikTok Analyzer uses the following fields:

  • author – username
  • id – video’s ID
  • music.play_url – URL to the raw mp3 sound track
  • video.download_addr – URL to the raw video

Crawling videos by user is very similar. The only different thing is the method on api object:

with TikTokAPI() as api:
    videos_obj = api.user(user, video_limit=video_limit)

Once we have a list of objects, we use request library to grab music/sound and video:

import urllib.request as req
req.urlretrieve(url, output_file)

Where url is video.music.play_url for sound and video.video.download_addr for video content.

Converting speech to text

In September 2022, OpenAI released a whisper library which provides a multilingual model for speech to text analysis. It supports many languages, however, it works best in English. You can choose between 5 multilingual models. From the tiny one which is relatively small and fast, but not very accurate, to the biggest one which is slower, but it gives better results. Read more on the official repository.
Running transcription from Python is very easy:

import whisper

class VoiceAnalyzer:

    def __init__(self, model_name):
        self.model = whisper.load_model(model_name)

    def transcribe(self, path):
        r = self.model.transcribe(path)
        print(f'Text: {r["text"]}')
        return r

transcription = VoiceAnalyzer("base").transcribe("path/to/file.mp3")

As you can see, the result object has text field which contains recognized text. In addition, there is more detailed information where the text is divided into segments.

When you use whisper in your own application, remember to load the model once at the beginning as it will take some time.

Wrapping up

I’ve shown a simple example of how you can crawl TikTok videos and carry out a useful analysis using Artificial Intelligence. From the programmer perspective, it was very easy thanks to two Open Source libraries. Remember that you need to be careful with all scrapping libraries for services like TikTok as there is no guarantee that it will work and that you will get all the expected results.

What’s next? The sky is the limit but here are my first ideas:

  • Making TikTok Analyzer more scalable. For now, I’m using synchronous API, but TikTokPy also provides an async way. It would be nice to have parallel computation for crawlers and speech to text processing.
  • Persisting text and videos statistics to Full Text Search engine like Elasticsearch or Solr. You can use Web UI like Kibana on top of ES to build interactive dashboards and to gain better insights from bigger data volumes.
  • Running OCR on video frames.
  • Building a relation between users using hashtags or other parts of the content.

If you are interested to contribute (to this project) or take part in commercial cooperation in this area, contact me on jan@datahunters.ai.

Stay tuned!

Related Posts
Welcome to our blog!

If you are searching for information related to Big Data, mostly focused on Open Source Intelligence, you are in the Read more

Extracting metadata (Exif) with Metadata Digger

In last post we mentioned about Metadata Digger. It’s a tool for extracting and analyzing metadata from huge amounts of Read more

Video – How to extract metadata from images with Metadata Digger

We've just uploaded video showing how to use Metadata Digger for simple metadata extraction: https://www.youtube.com/watch?v=RupViGRx3ac Basically it's almost the same Read more

Solr – Full Text Search for Big Data and OSINT

Today we want to make a basic introduction to Apache Solr and then (in the following post) use it as Read more

Leave a Reply

Your email address will not be published. Required fields are marked *