Skip Navigation

User banner
Posts
127
Comments
598
Joined
2 yr. ago

  • A core design issue of either approach is that server operators can modify the building of this data without needing to modify or restart the lemmy_server Rust code.

    Using a smallint also gives some flexibility (or a new field if going with the id min max approach).... if page greater than 10 for a particular sort, go to include > 1 and fall into tiers.

  • An even less-intrusive approach is to not add any new field to existing tables. Establish a reference table say called include_range. There is already an ENUM value for each sort type, so include_range table with these columns: sort_type ENUM, lowest_id BigInt, highest_id BigInt

    Run a variation of this to populate that table:

     
        
    FROM
      (
         SELECT id, community_id, published,
            rank() OVER (
               PARTITION BY community_id
               ORDER BY published DESC, id DESC
               )
         FROM post_aggregates) ranked_recency
    WHERE rank <= 1000
    
      

    Against every sort order, including OLD. Capture only two BigInt results: the MIN(id) and the MAX(id) - that will give a range over the whole table. Then every SELECT on post_aggregates / post table includes a WHERE id >= lowest_id AND id <= highest_id

    That would put in a basic sanity check that ages-out content, and it would be right against the primary key!

  • Good results with this approach. I hadn't considered the RANK OVER PARTITION BY criteria_a values and it works like a champ. It moves the ORDER BY into the realm of focus (criteria_a) and performance seems decent enough... and it isn't difficult to read the short statement.

     
        
    SELECT COUNT(ranked_recency.*) AS post_row_count
    FROM
      (
         SELECT id, post_id, community_id, published,
            rank() OVER (
               PARTITION BY community_id
               ORDER BY published DESC, id DESC
               )
         FROM post_aggregates) ranked_recency
    WHERE rank <= 1000
    ;
    
      

    Gives me the expected results over the 5+ million test rows I ran it against.

    If you could elaborate on your idea of TOP, please do. I'm hoping there might be a way to wall the LIMIT 1000 into the inner query and not have the outer query need to WHERE filter rank on so many results?

  • What problem are you trying to solve?

    Reproducible regular server crashes from queries taking tens of seconds long because the whole logic is based on no WHERE clause that has any meat to it. The server overloads in the field have been going on every single day that I've been here testing the big servers since May 2023.

    If I want to look through 1000 posts in a community, I probably want to look at more than 1000.

    I'm well aware of the push back. Everyone chimes in saying they want counting to be real time, the developers seem to avoid caching at all cost, and out of desperation - I'm trying to build some kind of basic sanity logic into the system so it doesn't plow through 5 million rows to do a LIMIT 10 query.

    Right now Lemmy works perfectly fine with no personalization. Anonymous users - it works great. If you want to read a million posts, it works great. Start blocking specific users, start adding in NSFW filters, cherry-picking a blend of communities, etc. and the problems show up. The ORM logic is difficult to follow, based on massive JOIN of every field there is in many tables, and at certain data thresholds with per-account preferences engaged - it goes off the rails into the pile of over 1 million posts (taking 40 seconds to list page = 1 of LIMIT 20 posts for even a single community).

    The programmers who built the code for over 4 years don't seem to think it is an urgent problem. So I'm chipping in. I personally have never worked with this ORM and I find it painful compared to the hand-crafted SQL I've done on major projects. I'm doing this because I feel like nobody else has for months.

  • ok, experimenting on a massive test data set of over 5 million posts... this PostgreSQL works pretty well

     
        
    SELECT COUNT(ranked_recency.*) AS post_row_count
    FROM
      (
         SELECT id, community_id, published,
            rank() OVER (
               PARTITION BY community_id
               ORDER BY published DESC, id DESC
               )
         FROM post_aggregates) ranked_recency
    WHERE rank <= 1000
    ;
    
      

    This limits any one community to 1000 posts, picking the most recent created posts. This gives a way to age out older data in very active communities without removing any posts at all for small communities.

  • That is, give a “next page” token

    There's already a pull request on changing paging.

    My focus is a very hard wall on performance, scale. There is way too much potential for data to run into the full post table as things are now.

  • I'll say this: a lot of discussion seems to take place on Matrix chat that doesn't make it into GitHub code comments as to why specific changes are made.

    It used to be you could see the actual content of deleted comments and it was at the discretion of the client to show or not show them. The newcomers from Reddit (June) seemed to not like that people could read content of deleted comments, so I think changes were made for that reason.

    With federation, it really isn't reasonable to expect content copies to all be deleted. So it's a complex issue.

  • It can't be a simple as a date range, because we want to be inclusive for smaller communities.

    1. paging is a consideration. 1000 posts per community would allow 10 pages of 20 posts.
    2. small communities are defined to be 1000 or less posts, regardless of age
    3. large communities would focus on recency, the 1000 post would be recently created or edited
    4. Edited can be more tricky, either skip for now or focus on how to limit some kind of mass edit from taking over newly published

     

    Also a good time to be reminded that the published date isn't reliable for a couple reasons:

    1. problems in the field have been shown with incoming federation data having future published dates on content. kbin in an easy example, but it isn't limited to kbin.
    2. federation can lag due to server overload and problems paths between specific servers, ISP issues, etc. It is rather common to have received a post hours after the published date. Lemmy currently does not track the 'received' date of content.
  • With 0.18.2, 0.18.3 there were changes in the behavior of comment sorting and delete / remove behavior. It is entirely possible that behavior changed, intentional or otherwise.

    There is a !test@lemmy.ml community where you could create comments, do some screen shots before and after delete.

  • lemmy-ui is still pretty bad about presenting spinning graphics when encountering an error. As for why the title isn't rejected, maybe it's too short, I don't know the length minimum.

  • An Instance is just another word for 'server' in lemmy terminology. HDTV is a classic form of media that doesn't involve TCP/IP to watch films and other video content.

  • i'm curious about alternate front-end / API clients....

  • If everyone was spread out onto different instances

    Each instance with an owner/operator making rules... that the average social media user walks in, orders a drink, and starts smoking without any concern that neither one may be allowed. People can be loyal to their media outlets even when it is beyond obvious they are bad. People raised on storybooks that endorse bad behaviors and values, HDTV networks, and social media too. Audience desire to "react comment" to images and not actually read what others have commented - nor learn about the venue operators and reasons for rules is pretty much the baseline experience in 2023.

  • When it comes to media attraction, what they call themselves (labels) don't really matter that much. It's the praise of strong men, authority, that crosses all mythological media systems. Be it bowing down to a burning bush story, Fox News, or Kremlin.

  • Keep in mind that you’re going to be retrieving and storing a huge amount of data running these scripts

    And you are adding to the overload of lemmy.world, beehaw, lemmy.ml, etc who have all the popular content communities. Federation has a lot of overhead, as does having to distribute a community one vote at a time to 500 subscribed servers.

  • Lemmy @lemmy.ml

    FYI: even Lemmy servers upgraded to 0.18.0 are having problems replicating comments to each other. Missing comments on posts examples

    Juke Box - share music @sh.itjust.works

    Long Distance Runaround

    Lemmy Server Performance @lemmy.ml

    lemmy_server API for Clients and Federation alike, concurrency self-awareness, load sheding, and self-tuning

    Juke Box - share music @sh.itjust.works

    Fly Like an Eagle

    Lemmy Server Performance @lemmy.ml

    lemmy-ui seems to be doing database search for Post Title matches while typing every single character or edit, for performance reasons suggest an option to disable this feature be available to admins

    lemmy.ml meta @lemmy.ml

    At the time of this posting, lemmy.ml is crashing internally and frequently reporting JSON parsing problems of "timeout" messages - past 15 minutes

    Lemmy Wish list @lemmy.ml

    Lemmy 0.18.0 does not work when Firefox or Chrome saves a web page when viewing a Post ("Page Not Found") · Issue #1587 · LemmyNet/lemmy-ui

    Juke Box - share music @sh.itjust.works

    Leonard Cohen - Nevermind (Audio)

    Lemmy Server Performance @lemmy.ml

    Add http cache for webfingers by cetra3 · Pull Request #3317 · LemmyNet/lemmy (adding moka as an in-memory cache to Rust code)

    Juke Box - share music @sh.itjust.works

    Billy Joel - You May Be Right (Official Audio)

    Lemmy Wish list @lemmy.ml

    Sorting by hot includes very old posts (Allow instance operators to trigger immediate rebuild of sorting, plus tune parameters on what is 'hot' and 'active')

    Lemmy Wish list @lemmy.ml

    Procedures & code for ReHoming a Lemmy Community before and after an Instance is "lost"

    Lemmy Administration @lemmy.ml

    Lemmy as a project has suffered all month because Lemmy.ml has not been sharing critical logs from Nginx and Lemmy's code logging itself

    Lemmy Administration @lemmy.ml

    Queries for instance admins, for detecting bots and spam users. - Lemmy

    Lemmy Wish list @lemmy.ml

    lemmy server database UPDATE of each start of service and an API call to return uptime/start time of lemmy_server code. metainfo table or something

    Lemmy Server Performance @lemmy.ml

    Lemmy Server - full text real-time searching of all comments&posts is a performance concern. Might have worked when lemmy database was low on content.

    Lemmy Server Performance @lemmy.ml

    Lemmy's Rust code - interfacing to PostgreSQL, http methods for federation connections inbound and outbound

    Lemmy Wish list @lemmy.ml

    Ability for site admin to list words that users/communities can't register with to prevent impersonation.

    World News @beehaw.org

    Frozen fruit sold at Walmart, Aldi, Trader Joe’s, Target, Whole Foods recalled over possible listeria contamination | CNN

    Lemmy Wish list @lemmy.ml

    Even with lemmy.ml restarting today now on version 0.18, the default "active" view of posts seems neither active or fresh