disco-reaper/BACKUP.md
2026-03-05 11:47:24 +05:30

18 KiB

Discord Reaper: Backup System Technical Specification

This document provides a deep-dive into the architecture, data lifecycle, and resilience strategies of the Discord Reaper backup system.

1. Architectural Overview

The backup system is built on a decoupled architecture that separates the API communication layer from the business logic and I/O operations.

  • DiscordReader (API Provider): A high-level wrapper around the discord.py library. It handles authentication, rate limiting, and provides an asynchronous interface for fetching guild data, message history, and binary assets. It focuses on fetching rather than processing.
  • DiscordExporter (Orchestration & Serialization): The core engine that defines the export lifecycle. It consumes data from the Reader, transforms it into standardized schemas, and manages local filesystem operations.

Component Interaction Diagram

graph TD
    A[UI / CLI] --> B[DiscordExporter]
    B --> C[DiscordReader]
    C --> D[Discord API]
    B --> E[Local Filesystem]
    B --> F[User Cache Object]

File Tree Structure

DISCORD_BACKUP-{ServerID}/
├── server_profile/
│   ├── profile.json           # Server metadata (name, ID, icon/banner paths)
│   ├── roles.json             # All server roles (permissions, colors, positions)
│   ├── structure.json         # Full category and channel hierarchy
│   ├── assets.json            # Index of custom emojis and stickers
│   └── assets/                # Binary media files
│       ├── server_icon.png
│       ├── server_banner.png
│       ├── emoji_{name}_{id}.png
│       └── sticker_{name}_{id}.png
└── message_backup/
    ├── users/
    │   ├── user_info.json     # Deduplicated user profile cache
    │   └── avatars/           # User avatar images
    │       └── {user_id}.png
    └── {channel_id}/
        ├── messages.json      # Channel message history + metadata
        ├── attachments/       # Channel-level attachments
        │   └── {filename}-{id_last_5}.{ext}
        └── {thread_id}/       # Thread nested inside parent channel
            ├── thread_messages.json
            └── thread_attachments/
                └── {filename}-{id_last_5}.{ext}

2. Data Lifecycle & Serialization

2.1 Incremental Synchronization Algorithm

To achieve idempotency and efficiency, the system implements an incremental sync strategy using Discord's snowflake IDs.

  1. State Loading: The Exporter reads the existing {channel_id}/messages.json (if present).
  2. Snowflake Extraction: It extracts the lastMessageID from the metadata.
  3. Filtered Fetch: It calls fetch_message_history(after_id=last_id).
  4. In-Memory Merge: New messages are appended to the existing list.
  5. Atomic Write: The updated JSON is written back to disk, ensuring that only new delta data is fetched from the API.

2.2 User Profile Deduplication (user_info.json)

The system avoids redundant storage of user metadata (usernames, roles, colors) by using a global user_cache map.

  • Key: userID (Snowflake).
  • Policy: Users are added to the cache only on their first appearance in any channel's history.
  • Avatar Persistence: User avatars are stored in message_backup/users/avatars/ and referenced by relative paths in the JSON schemas.

3. Special Channel Type Specifications

3.1 Forum Channels & Threads

Forums present a hierarchical challenge where the "starter message" and the "conversation" exist in separate contexts.

  • Forum Index ({forum_id}/messages.json): Contains an enriched list of "starter messages" representing each thread. These entries include thread titles, applied tags, and total attachment stats (summed from the entire thread).
  • Thread Persistence: All threads nest inside their parent channel directory:
    • Forum Threads: message_backup/{forum_id}/{thread_id}/thread_messages.json
    • Regular Threads: message_backup/{parent_channel_id}/{thread_id}/thread_messages.json
  • Starter Identification: The system uses thread.history(limit=1, after=snowflake(thread_id - 1)) to reliably capture the first post even if it has been edited or pinned.

4. Resilience & Error Handling

4.1 Permission Resilience (403 Forbidden)

The system is designed to "fail-soft" when encountering restricted content:

  • Server Level: If the bot lacks view_channel or read_message_history globally, the backup aborts with a clear error.
  • Channel Level: If a specific channel is restricted, the error is logged, and the system proceeds to the next channel to ensure a partial backup is still completed.
  • Asset Level: If an emoji or sticker cannot be downloaded due to permissions, the metadata is preserved with a null local path.

4.2 Lottie Sticker Workaround

Discord's Lottie stickers (format: 3) are not supported by standard discord.py save methods. The system implements a bypass:

  1. Extracts the internal aiohttp session from the client: client.http._HTTPClient__session.
  2. Performs a direct GET request to the sticker URL.
  3. Streams the raw byte data directly to a .json file locally.

5. Technical Schemas

5.1 Message Object (_format_message)

The internal representation of a message focuses on portability:

Field Type Description
messageID String Original Discord Snowflake
type String Normalized type (Text, ThreadStarter, Forward, etc.)
timestamp ISO8601 Created date/time
isPinned Boolean Pin status
content String Raw markdown content (or snapshot content for forwards)
userID String Reference to user_info.json
attachments Array List of local file references and metadata
embeds Array Raw Discord-formatted embed objects
stickers Array List of Message Sticker objects (see below)
reactions Array List of Reaction objects

Message Sticker Object

Field Type Description
id String Sticker Snowflake ID
name String Sticker name
format String File format (PNG, APNG, LOTTIE, GIF)
localPath String Relative path to local file in {channel_id}/attachments/

Reaction Object

Field Type Description
emoji String String representation (unicode or name:id)
count Integer Total count of this reaction

5.2 Asset Naming Logic

To prevent filename collisions (e.g., multiple files named image.png), the system uses a suffixing strategy: {filename_stem}-{snowflake_last_5}.{ext}

Example: sunset-54321.png

5.3 profile.json Specification

Path: server_profile/profile.json

Field Type Description
name String Original Discord guild name
id String Guild Snowflake ID
icon String Relative path to local guild icon in server_profile/assets/
banner String Relative path to local guild banner in server_profile/assets/
last_backup ISO8601 Timestamp of the last successful backup run
ignore_channels Array List of channel Snowflakes explicitly excluded from backup

5.4 roles.json Specification (Array of objects)

Path: server_profile/roles.json

Field Type Description
id String Role Snowflake ID
name String Role name
color String Hex-string representation of role color (e.g. "#ffffff")
position Integer Vertical position in the hierarchy (0 is bottom)
permissions Integer Bitwise integer representing the role's Discord permissions
hoist Boolean Whether the role is displayed separately in the sidebar
mentionable Boolean Whether the role can be mentioned

5.5 assets.json Specification

Path: server_profile/assets.json Contains two primary arrays: emojis and stickers.

Emoji Object

Field Type Description
id String Emoji Snowflake ID
name String Emoji name (without colons)
animated Boolean True if the emoji is a GIF
filename String Filename within server_profile/assets/

Sticker Object

Field Type Description
id String Sticker Snowflake ID
name String Sticker name
filename String Filename within server_profile/assets/

5.6 structure.json Specification (Array of Category objects)

Path: server_profile/structure.json

Category Object

Field Type Description
type String Always "category"
id String Category Snowflake ID (or "uncategorized")
name String Category name
position Integer Vertical position in hierarchy
channels Array List of Channel objects (see below)

Channel Object

Field Type Description
id String Channel Snowflake ID
name String Channel name
type String "text", "voice", "forum", "news", or "thread"
position Integer Vertical position within the category
topic String Channel description/topic (null if empty)
nsfw Boolean True if marked Restricted/NSFW
available_tags Array List of Forum Tag objects (see below)

Forum Tag Object

Field Type Description
id String Tag Snowflake ID
name String Tag display name
moderated Boolean True if restricted to moderators
emoji_id String ID of the tag's emoji (null if unicode/none)
emoji_name String Name of the tag's emoji

5.7 user_info.json Specification (Array of User objects)

Path: message_backup/users/user_info.json

Field Type Description
userID String User Snowflake ID
username String Current global username
userNickname String Server-specific nickname (display name)
userColor String Role-derived color for the user
userIsBot Boolean True if the account is a bot
userRoles Array List of role snippets (name, id, color, position)
userAvatar String Relative path to local avatar in users/avatars/

5.8 Channel History JSON Specification

Path: message_backup/{channel_id}/messages.json

This file contains the full history of a channel along with synchronization metadata.

Field Type Description
channelName String Human-readable name of the channel
channelID String Channel Snowflake ID
channelType String "Text", "Thread", "News", or "Forum"
messageCount Integer Total number of messages stored in the messages array
threadCount Integer (If Parent) Count of threads associated with this channel
lastMessageID String ID of the most recent message (used for incremental sync)
totalAttachmentSizeBytes Integer Summed size of all attachments for this channel
numberOfAttachments Integer Total count of attachments
lastBackup ISO8601 Timestamp of last message fetch
messages Array The message objects (see Section 5.1)
parentID String (If Thread) Snowflake of the parent channel

5.9 Thread History JSON Specification

Path: message_backup/{channel_id}/{thread_id}/thread_messages.json

Same schema as Section 5.8, with channelType set to "Thread" and parentID always present.


7. Backup Reader Implementation Guide

This section is a technical manual for developers building third-party tools (viewers, search engines, or analytics) to consume Discord Reaper backups.

7.1 Entry Point Discovery

A reader should start by identifying the backup root directory (prefixed with DISCORD_BACKUP-).

  1. Parse server_profile/profile.json: Extract the server name, ID, and assets (icon/banner).
  2. Load server_profile/structure.json: This defines the navigation tree for your UI.
    • Iterate through categories.
    • Map channels to their respective types (text, voice, forum).
    • Store the position to preserve the original visual order.

7.2 Relational Data Mapping

The backup data is normalized to minimize duplication. A reader must implement the following resolve logic:

  • User Resolution: When parsing a message in {channel_id}/messages.json, the userID must be cross-referenced against the userID keys in message_backup/users/user_info.json.
  • Role Resolution: Use the userRoles array (IDs) from the user object and resolve them against the role metadata in server_profile/roles.json to get colors and names.
  • Static Asset Resolution:
    • Server Assets: Prepend server_profile/assets/ to filenames found in server_profile/assets.json.
    • User Avatars: Resolve userAvatar paths found in user_info.json (pointing to users/avatars/).

7.3 Message Rendering Logic

When rendering the messages array from a channel JSON:

Feature Reader Implementation Logic
Markdown Content is raw Discord markdown. Use a library like markdown-it with discord-specific plugins.
Attachments Resolve url field ({channel_id}/attachments/{filename}) relative to the message_backup/ directory.
Emojis/Stickers If a message contains custom emojis/stickers, resolve their metadata via server_profile/assets.json.
Replies Use the reference object to find the target messageId. Note: The target might be in the same file or a different channel/thread.

7.4 Thread & Forum Reconstruction

Reconstructing the hierarchy requires specific pointer logic:

  1. Forums:
    • Read message_backup/{forum_id}/messages.json.
    • Each message in this file is a Thread_starter_message.
    • The messageID of the starter message is usually the same as the thread_id.
    • To load the full thread, open message_backup/{forum_id}/{thread_id}/thread_messages.json.
  2. Regular Threads:
    • Discoverable via the parentID field in any message or by scanning for thread_messages.json inside channel directories.
    • Match the thread.id in a ThreadStarter message to the respective subdirectory.

8. Discord.py Model Hydration Guide

If you are building a discord.py API-compatible wrapper to read these backups directly into familiar Discord objects, here is the explicit property mapping from the schema to the standard discord.py object attributes.

8.1 Base Server (Guild)

File: server_profile/profile.json & server_profile/roles.json & server_profile/structure.json

  • discord.Guild:
    • id: Cast id (str) to int.
    • name: Mapped directly from name.
    • icon / banner: Represented as discord.Asset objects. Use the local file paths from icon / banner as the asset URL/filepath.
    • roles: Hydrated from server_profile/roles.json.
    • channels / categories: Hydrated from server_profile/structure.json.

8.2 Roles (discord.Role)

File: server_profile/roles.json

  • id: Cast id to int.
  • name: Mapped directly.
  • color: Parse the hex string to discord.Color(value).
  • position: Mapped directly.
  • permissions: Initialize discord.Permissions(value=int(permissions)).
  • hoist: Mapped directly to boolean.
  • mentionable: Mapped directly to boolean.

8.3 Users & Members (discord.Member / discord.User)

File: message_backup/users/user_info.json

  • id: Cast userID to int.
  • name: Mapped from username.
  • display_name: Mapped from userNickname.
  • bot: Mapped from userIsBot.
  • color: Parse userColor string to discord.Color.
  • roles: List of hydrated discord.Role objects via matching ids from the userRoles array.
  • avatar: Mocked discord.Asset using the userAvatar local path.

8.4 Channels (discord.TextChannel, discord.CategoryChannel, discord.ForumChannel)

File: server_profile/structure.json

  • Iterate over the top-level array (Categories):
    • discord.CategoryChannel:
      • id: Cast id to int.
      • name: Mapped directly.
      • position: Mapped directly.
  • Iterate over the nested channels array:
    • discord.abc.GuildChannel classes:
      • id: Cast id to int.
      • name: Mapped directly.
      • position: Mapped directly.
      • type: Match the type string back to the discord.ChannelType enum.
      • category_id: Inherited from the parent category block.
      • topic: Mapped directly (if applicable).
      • nsfw: Mapped directly to boolean.

8.5 Messages (discord.Message)

File: message_backup/{channel_id}/messages.json (Iterating the messages array)

  • id: Cast messageID to int.
  • type: Map the string type (e.g., "Default", "Reply") to discord.MessageType.
  • created_at: Parse timestamp (ISO-8601 string) into a timezone-aware datetime object.
  • pinned: Mapped from isPinned.
  • content: Mapped from content.
  • author: Resolve the userID against the loaded discord.Member mocks.
  • embeds: Instantiate using discord.Embed.from_dict(embed_dict) directly on the elements of the embeds array.
  • Reference (Replies):
    • If reference exists, hydrate a discord.MessageReference.
    • message_id: Cast reference.messageId to int.
    • channel_id: Cast reference.channelId to int.

8.6 Attachments (discord.Attachment)

Nested within Message objects.

  • id: Cast id to int.
  • filename: Mapped from fileName.
  • size: Mapped from fileSizeBytes.
  • url / proxy_url: Point to the local relative path ({channel_id}/attachments/{resolved_filename}).

8.7 Reactions (discord.Reaction & discord.PartialEmoji)

Nested within Message objects.

  • count: Mapped from count.
  • emoji: Iterate the emoji string. If custom (contains a :), split it to mock a discord.PartialEmoji(name=..., id=...). Otherwise, mock standard unicode strings.