How to Write a Python Script to Download Reddit videos

Written by h3avren | Published 2021/12/19
Tech Story Tags: python-programming | webscraping | python-tutorials | python3 | web-scraping-with-python | beautiful-soup | ffmpeg | download-reddit-videos | web-monetization

TLDRGet links for the video (without audio) and audio from the Reddit link and then use ffmpeg to copy the audio into the video to finally have a reddit video with audio in it.via the TL;DR App

Hello friends! Let’s do something new today (well it’s not new to be honest).

Where did it all start?

I was trying to download videos from Reddit and found that the browser extensions available were either paid or not working. But, thankfully there were some websites that did work just great. I tried that for quite a while until I found it boring to open the website every once and then I felt a need to download something. So, I thought why not create a Python script for the same which could be easily passed on the link and have the file downloaded.

Before we Start…

Reddit has a different way of storing videos so as to make it harder to download (but we will anyways). It stores the video without the audio at one place and the audio at another URL and when we use the Reddit player it loads and plays both of these simultaneously. So, we will download both of these and stitch them with ffmpeg.

The audio URL just replaces the quality factor with ‘audio’ in the video URL.

Let’s say the video URL is :

Let’s act now…

Libraries that we need to install

  • subprocess : for system calls to run ffmpeg
  • sys : to work with command-line arguments
  • bs4 (aka BeautifulSoup) : for web scraping
  • requests : for making HTTP requests
  • json : to parse json data
  • ffmpeg : to work with media files it should be installed on the machine

Kicking off…

# imports

import subprocess
import json
from bs4 import BeautifulSoup
import requests
import sys

# getting a response using the URL

url = sys.argv[1]    # gets the url passed in the command-line
headers = {'User-Agent':'Mozilla/5.0'}
response = requests.get(url,headers = headers)

# finding the post id for the Reddit post

post_id = url[url.find('comments/') + 9:]
post_id = f"t3_{post_id[:post_id.find('/')]}"

What does the response have..?

The response is the whole HTML file written for that particular Reddit page. But as Reddit is a dynamic website most of the HTML we see is generated using JavaScript and so are the media files. Therefore, to find the links to the media files we’ll have to find the script tag with the data.

I googled it and found that a JSON file can simply be obtained from a Reddit link appending .json at the end of each Reddit link and the video URL’s could be easily grabbed from there. But, I decided to dig into the original HTML code and find the script tag with the data. And I found it. It was in a script tag with the id attribute set to ‘data’. Let’s find extract that using BeautifulSoup.

# processing the response to find the data

if(response.status_code == 200):    # checking if the server responded with OK
  soup = BeautifulSoup(response.text,'lxml')
  # I looked up the original code of the reddit page 
  # to find where all the data was and it was in a script tag
  # with the id set to 'data'
  required_js = soup.find('script',id='data') 
  
  json_data = json.loads(required_js.text.replace('window.___r = ','')[:-1])
  # 'window.___r = ' and a semicolon at the end of the text were removed
  # to get the data as json
  title = json_data['posts']['models'][post_id]['title']
  title = title.replace(' ','_')
  dash_url = json_data['posts']['models'][post_id]['media']['dashUrl']
  height  = json_data['posts']['models'][post_id]['media']['height']
  dash_url = dash_url[:int(dash_url.find('DASH')) + 4]
  # the dash URL is the main URL we need to search for
  # height is used to find the best quality of video available
  video_url = f'{dash_url}_{height}.mp4'    # this URL will be used to download the video
  audio_url = f'{dash_url}_audio.mp4'    # this URL will be used to download the audio part

# downloading the video and audio files

with open(f'{title}_video.mp4','wb') as file:
    print('Downloading Video...',end='',flush = True)
    response = requests.get(video_url,headers=headers)
    if(response.status_code == 200):
        file.write(response.content)
        print('\rVideo Downloaded...!')
    else:
        print('\rVideo Download Failed..!')

with open(f'{title}_audio.mp3','wb') as file:
    print('Downloading Audio...',end = '',flush = True)
    response = requests.get(audio_url,headers=headers)
    if(response.status_code == 200):
        file.write(response.content)
        print('\rAudio Downloaded...!')
    else:
        print('\rAudio Download Failed..!')

# using ffmpeg to stitch the video and audio into one

subprocess.call(['ffmpeg','-i',f'{title}_video.mp4','-i',f'{title}_audio.mp3','-map','0:v','-map','1:a','-c:v','copy',f'{title}.mp4'])
subprocess.call(['rm',f'{title}_video.mp4',f'{title}_audio.mp3'])

Finally!

We have our video downloaded successfully and the other downloaded files too are trashed. I didn’t explain a lot of things in detail but I hope this article does really get you interested towards learning Web Scraping. I’ll also recommend learning the ffmpeg tool. Wish you a happy coding journey! 🙂🙂🙂


Written by h3avren | Dreaming of Python... Under a sky in India...
Published by HackerNoon on 2021/12/19