TikTok has been attracting a lot of attention of OSINT investigators in recent years. In this article, I will show how to implement a Python application which downloads TikTok videos and converts speech to text using OpenAI model.
TikTok as a data source
There is a number of reasons why social media are a valuable source of data for investigators. Finding information about individuals is one of them. Since the Russian invasion, TikTok has played a significant and visible role in terms of Information Warfare.
It’s much easier to gather and analyze data from text based services like Twitter to detect disinformation than to do so based on other social media channels which mostly rely on videos as their main source of content. Before we start any massive offline search or analysis, we have to download videos and process them to retrieve useful information.
A lot can be done with videos, e.g.:
- Extracting sound and converting speech to text. In further steps, results could be indexed in one of the Full-Test Search engines.
- Extracting text from video frames.
- Detecting objects (buildings, people, cars, etc.) on particular frames.
- Detecting movements of objects between frames.
- Detecting if a video and/or sound is deep fake.
I have written a simple Python application which gathers videos, and then it converts speech to text using the OpenAI model. Let’s see how it works!
TikTok Analyzer
The application is available on Github: https://github.com/data-hunters/tiktok-analyzer. In this section, I will show how to use it. If you are a programmer, go to the next sections, where we will go through the most important parts of the code.
Let’s start from the most important library – TikTok scrapper. I used an existing library – TikTokPy. The main problem with scrapping TikTok is that the official API is very limited. It is a never ending race between changing TikTok’s interface and Open Source contributors, who are constantly updating the code and are looking for some anti-bot detector’s workarounds. If you search for TikTok scrappers on Github/Google, you will find plenty of apps/scripts, but some (or most of them) won’t work. At the time of writing this article, TikTokPy was working properly, so I hope it also works now ;).
One thing I’ve noticed is the occurrence of 403 errors (while downloading a video) from time to time. It could be related to TikTok anti-bot mechanisms.
Setup
At the beginning, clone repository:
git clone https://github.com/data-hunters/tiktok-analyzer.git
or download ZIP.
Before you run the application, install required libraries:
pip install tiktokapipy python -m playwright install pip install whisper-openai
Downloading videos by hashtags
To download 10 latest videos (and soundtrack) by hashtag ukraine
to tiktok_videos
directory, run this:
python run.py --hashtag ukraine --output-path tiktok_videos --max-videos 10
Each file has the following name format: “<username>_<video_id>.<format>” where the format is mp3
or mp4
.
Downloading users’ videos
To download 10 latest test123
user videos (with a sound track) to
directory, run the following:tiktok_videos
python run.py --user test123 --output-path tiktok_videos --max-videos 10
Speech to text
Converting sound from the video to text is a separate process that could be run in a new command. You need to pass an input directory, and TikTok Analyzer will process all mp3 files within this directory. There is also one additional parameter that you can specify – the name of the OpenAI model. Default value is base
. If it’s not sufficient (in terms of the quality of the output transcription), you check a bigger model. The full list is available here: https://github.com/openai/whisper#available-models-and-languages.
The following command runs Analyzer on files from
directory, saves results to tiktok_videos
tiktok_transcription
directory and it uses medium
model.
python run.py --transcribe --input-path tiktok_videos --output-path tiktok_transcription --model medium
Putting all the pieces together
You can also utilize all the features of Analyzer with a single command. It will go through the steps in the following order:
- Scrapping videos by hashtag.
- Scrapping users’ videos.
- Converting speech to text.
If we want to combine all the commands used, it will look as follows:
python run.py --hashtag ukraine --user test123 --max-videos 10 --transcribe --input-path tiktok_videos --output-path tiktok_videos --model medium
Implementation
Let’s go through the most important parts of the code.
Scrapping
The whole code responsible for downloading videos (by user and hashtag) is located in ttanalyzer/scrapper.py file. Like I mentioned at the beginning, I used the TikTokPy library for scrapping purposes. It doesn’t use the official TikTok API, so no credentials are needed. You can just import TikTokAPI class:
from tiktokapipy.api import TikTokAPI
build an api object and call a challenge endpoint to retrieve the most popular videos by hashtag. As you can see below, you can limit the number of downloaded videos. You can ignore it and the scrapper won’t set any limit. It doesn’t mean that it will download all videos with a particular hashtag. TikTok has anti-bot mechanisms and it will probably block you if you don’t use this in a reasonable way.
with TikTokAPI() as api: videos_wrapper = api.challenge(hashtag, video_limit=video_limit)
Returned object (videos_wrapper) contains videos field which is a list of objects providing information about particular videos. I encourage you to run debugger and see what is inside each object. TikTok Analyzer uses the following fields:
author
– usernameid
– video’s IDmusic.play_url
– URL to the raw mp3 sound trackvideo.download_addr
– URL to the raw video
Crawling videos by user is very similar. The only different thing is the method on api
object:
with TikTokAPI() as api: videos_obj = api.user(user, video_limit=video_limit)
Once we have a list of objects, we use request library to grab music/sound and video:
import urllib.request as req req.urlretrieve(url, output_file)
Where url is video.music.play_url
for sound and video.video.download_addr
for video content.
Converting speech to text
In September 2022, OpenAI released a whisper
library which provides a multilingual model for speech to text analysis. It supports many languages, however, it works best in English. You can choose between 5 multilingual models. From the tiny one which is relatively small and fast, but not very accurate, to the biggest one which is slower, but it gives better results. Read more on the official repository.
Running transcription from Python is very easy:
import whisper class VoiceAnalyzer: def __init__(self, model_name): self.model = whisper.load_model(model_name) def transcribe(self, path): r = self.model.transcribe(path) print(f'Text: {r["text"]}') return r transcription = VoiceAnalyzer("base").transcribe("path/to/file.mp3")
As you can see, the result object has text
field which contains recognized text. In addition, there is more detailed information where the text is divided into segments.
When you use whisper
in your own application, remember to load the model once at the beginning as it will take some time.
Wrapping up
I’ve shown a simple example of how you can crawl TikTok videos and carry out a useful analysis using Artificial Intelligence. From the programmer perspective, it was very easy thanks to two Open Source libraries. Remember that you need to be careful with all scrapping libraries for services like TikTok as there is no guarantee that it will work and that you will get all the expected results.
What’s next? The sky is the limit but here are my first ideas:
- Making TikTok Analyzer more scalable. For now, I’m using synchronous API, but TikTokPy also provides an async way. It would be nice to have parallel computation for crawlers and speech to text processing.
- Persisting text and videos statistics to Full Text Search engine like Elasticsearch or Solr. You can use Web UI like Kibana on top of ES to build interactive dashboards and to gain better insights from bigger data volumes.
- Running OCR on video frames.
- Building a relation between users using hashtags or other parts of the content.
If you are interested to contribute (to this project) or take part in commercial cooperation in this area, contact me on jan@datahunters.ai.
Stay tuned!