Get The Most Out Of Everything You Read Using Python

Imagine reading something, and never losing track of that information.

Motivation: I usually find it difficult to remember what I read as time passes. As shown in the graph above, memory retention drops exponentially after the first few days. I try to take thorough notes, and look them over regularly, but I usually need a trigger event for me to revisit them. This is super unsustainable, and I’m sure this is the case for most people. Wouldn’t it be great if you could visit your highlights more regularly? In the graph above, the more you review something you’ve learned, the more it becomes a part of you.

I searched the internet for a passive way to re-read notes and found readwise.io — a service that emails you your highlights everyday from various sources. Since I have been learning about object oriented functions of Python, and software architecture design patterns (and forgetting them mostly), I decided to put those skills to use and build a DIY version of the service for myself. Together, we’re going to build this application using Python (and its object oriented features). This application will make sure that anything you read (and highlight), gets presented to you on a regular basis so you never forget the material. Through spaced repetition, you can instill the notes in yourself.

Things this app does:

Selects notes and highlights you’ve compiled from your dataset
Sends an email with selected notes to a specified email account
Emails on a user-defined schedule using Cron

Let’s get started

We’re going to need data. This is the most manual step of the entire process. I use PDF expert to read PDFs and it has a feature to export all annotations. I simply put these in an excel document, which I then convert to JSON (using a generic Excel to JSON service on the internet). See the sample JSON file below. Each block represents a highlight/note.

# JSON data
{
    "Sheet1": [
        {
            "date_added": "May 12, 8:59 AM, by Ankush Garg",
            "source": "Book",
            "title": "Fundamentals of Software Architecture",
            "chapter": "N/A",
            "note": "N/A",
            "highlight": "The microkernel architecture style is a relatively simple monolithic architecture consisting of two architecture components: a core system and plug-in components.",
            "page_number": "Page 165",
            "has_been_chosen_before": "0",
            "id": "48"
        },
        {
            "date_added": "Apr 12, 10:50 AM, by Ankush Garg",
            "source": "Book",
            "title": "Genetic Algorithms with Python",
            "chapter": "Chapter 4: Combinatorial Optimization - Search problems and combinatorial optimization",
            "note": "N/A",
            "highlight": "A search algorithm is focused on solving a problem through methodic evaluation of states and state transitions, aiming to find a path from the initial state to a desirable final (or goal) state. Typically, there is a cost or gain involved in every state transition, and the objective of the corresponding search algorithm is to find a path that minimizes the cost or maximizes the gain. Since the optimal path is one of many possible ones, this kind of search is related to combinatorial optimization, a topic that involves finding an optimal object from a finite, yet often extremely large, set of possible objects.",
            "page_number": "Page 109",
            "has_been_chosen_before": "0",
            "id": "21"
        }
]}

Folder structure: I’ll be using Pycharm to build this app. Let’s construct empty .py files in a project directory shown in the image below. Feel free to put these files in any folder you prefer. The main thing we’re going for is that each of these services will rely upon each-other for their inputs/outputs.

They’ll take that data, transform it, and then do something with it.

A very reasonable question at this point is why I decided to create 4 separate scripts for simply reading in the data, selecting some entries, and emailing those to a specified email account. The reason is

MODULARITY

. I want each of these services to do exactly what they're designed to, and nothing more.

In the future, if I want to swap functionality out, I can do that easily because there's minimal dependency between each service. I'll give an example:

database.py

currently reads in the data file locally, but in the future as the dataset increases in volume, it may pull from data stored in S3.

Accommodating this change will require a massive overhaul throughout the application, but having a separate modular service with minimal dependency, allows for easily swapping big pieces of functionality at will.

Let’s walk through each of the service files:

database.py

import json

# Ended up using http://beautifytools.com/excel-to-json-converter.php to convert Excel to Json

# URL where data is stored - local on my computer for now
url = '/Users/ankushgarg/Desktop/email-reading-highlights/notes-email-sender/data/data.json'

def read_json_data():
    with open(url) as json_file:
        response = json.load(json_file)
    return response

Database file is simple. It loads data that’s stored locally using the

read_json_data

function. We now have access to the data in our application.

selector_service.py

# This script reads in the data from S3 and selects highlights
import numpy as np
from database import read_json_data


def increment_has_chosen_before(item):
    count_now = int(item['has_been_chosen_before'])
    item['has_been_chosen_before'] = count_now + 1


class SelectorService:
    def __init__(self):
        self.raw_response = read_json_data() # Read in JSON data
        self.sampled_object = None
        self.sheet_name_to_sample_by = 'Sheet1'
        self.num_of_entries_to_sample = 3 # Number of entries to select

    def select_random_entries(self):
        # Randomly choose entries from the dataset
        self.sampled_object = np.random.choice(self.raw_response[self.sheet_name_to_sample_by],
                                               self.num_of_entries_to_sample)

        # For each selection increment the field "has_been_chosen_before"
        # In the future can use probability to make selections to notes that haven't gotten selected
        for note in self.sampled_object:
            increment_has_chosen_before(note)
        return self.sampled_object

Selector Service has an attribute that relies on

read_json_data

as we saw above, and

self.raw_response

is the returned response. Three entries are selected randomly in

selected_random_entries

and stored in

self.sampled_object

We have sampled the entries now and are ready to parse that content.

parse_content.py

from selector_service import SelectorService


class ContentParser:
    def __init__(self):
        self.sample_entries = SelectorService().select_random_entries()
        self.content = None

    def parse_selected_entries(self):
        content = ''
        for item_index in range(len(self.sample_entries)):
            item = "DATE-ADDED: " + self.sample_entries[item_index]['date_added']
            content = content + item + "\n"
            item = "HIGHLIGHT: " + self.sample_entries[item_index]['highlight']
            content = content + item + "\n"
            item = "TITLE: " + self.sample_entries[item_index]['title']
            content = content + item + "\n"
            item = "CHAPTER: " + self.sample_entries[item_index]['chapter']
            content = content + item + "\n"
            item = "SOURCE: " + self.sample_entries[item_index]['source']
            content = content + item + "\n"
            item = "PAGE-NUMBER: " + self.sample_entries[item_index]['page_number']
            content = content + item + "\n" + "------------" + "\n"
        self.content = content
        return self.content

ContentParser

class takes in random entries, stores them as a class attribute

self.sample_entries

, and parses them in a format useful for emailing using

parse_selected_entries

method.

Parse_selected_entries

is simply formatting the content for the email to be sent out in the next step. It looks complicated, but text formatting is all that’s happening. Parsed content can now be emailed.

mail_service.py

# This service emails whatever it gets back from Content Parser
from parse_content import ContentParser
import smtplib
from email.message import EmailMessage


class MailerService:
    def __init__(self):
        self.msg = EmailMessage()
        self.content = ContentParser().parse_selected_entries()

    def define_email_parameters(self):
        self.msg['Subject'] = 'Your Highlights and Notes for today'
        self.msg['From'] = "example@gmail.com" # your email
        self.msg['To'] = ["example@gmail.com"] # recipient email

    def send_email(self):
        self.msg.set_content(self.content)
        with smtplib.SMTP_SSL('smtp.gmail.com', 465) as smtp:
            smtp.login("example@gmail.com", 'password') # email account used for sending the email
            smtp.send_message(self.msg)
        return True

    def run_mailer(self):
        self.define_email_parameters()
        self.send_email()


def run_job():
    composed_email = MailerService()
    composed_email.run_mailer()


run_job()

MailerService

takes in parsed content by

ContentParser

and stores it as

self.content

class attribute.

Define_email_parameter

sets email parameters such as subject, to and from, and sends the email using

send_mail

method.

Both methods are triggered by

run_mailer

and the entire application is run by

run_job

function at the very bottom. This sends out an email to a specified account. This is what the email looks like.

Sample Email

Congrats, you’ve made it this far!! One last thing is to run

mail_service.py

on a schedule. Let’s use

Crontab

for that. Cron is a long-running process that executes commands at specific dates and times, and can be used to schedule recurring tasks.

In your Crontab, add the following code with your absolute paths:

0 19 * * * /Users/ankushgarg/.pyenv/shims/python /Users/ankushgarg/Desktop/email-reading-highlights/notes-email-sender/mail_service.py >> /Users/ankushgarg/Desktop/email-reading-highlights/notes-email-sender/cron.log 2>&1

This script runs everyday at 7 PM PST. Check out https://crontab.guru/ for coming up with a schedule in a Cron format.

You’re done! My call to action is for you to make it better. Some ideas to enhance this project and make it yours:

Data preparation is mostly manual at the moment. You can automate that by parsing PDFs using Python.
The email send out isn’t pretty currently. You can use HTML to change the content to look better.
Use
```
has_been_chosen_before 
```
attribute to make the selection better. Currently the sampling is happening randomly with replacement. You can change it so that
```
has_been_chosen_before
```
informs probabilistically which highlight to include next.
Store your data on S3 and see if you can make it work. It’s a great exercise if you haven’t used S3 or any AWS service yet.
Involve friends and send each-other your highlights.
Use NLP to parse through text to come up with the context or summary or sentiment for each highlight, and include that in the dataset.

Once you have the structure down, there’s so much you can do. If you do decide to enhance this app, reach out and let me know so I can get some ideas for improvement as well. If anything is unclear, let me know and I’d be happy to clarify.

Cheers!