RoundSparrow @ RoundSparrow @lemmy.ml

Posts

127
Comments

598
Joined

2 yr. ago

2y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

A core design issue of either approach is that server operators can modify the building of this data without needing to modify or restart the lemmy_server Rust code.

Using a smallint also gives some flexibility (or a new field if going with the id min max approach).... if page greater than 10 for a particular sort, go to include > 1 and fall into tiers.

2y ago

Can I accomplish this in a single SQL statement?

Jump

Thank you

2y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

Jump

An even less-intrusive approach is to not add any new field to existing tables. Establish a reference table say called include_range. There is already an ENUM value for each sort type, so include_range table with these columns: sort_type ENUM, lowest_id BigInt, highest_id BigInt

Run a variation of this to populate that table:

    
FROM
  (
     SELECT id, community_id, published,
        rank() OVER (
           PARTITION BY community_id
           ORDER BY published DESC, id DESC
           )
     FROM post_aggregates) ranked_recency
WHERE rank &lt;= 1000

Against every sort order, including OLD. Capture only two BigInt results: the MIN(id) and the MAX(id) - that will give a range over the whole table. Then every SELECT on post_aggregates / post table includes a WHERE id >= lowest_id AND id <= highest_id

That would put in a basic sanity check that ages-out content, and it would be right against the primary key!

2y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

Jump

3 hours later... I put it into code and am experimenting with it. Some proof of concept results: https://github.com/LemmyNet/lemmy/files/12373819/auto_explain_list_post_community_0_18_4_dullbananas_with_inclusion_run0a.txt

2y ago

Can I accomplish this in a single SQL statement?

Jump

Good results with this approach. I hadn't considered the RANK OVER PARTITION BY criteria_a values and it works like a champ. It moves the ORDER BY into the realm of focus (criteria_a) and performance seems decent enough... and it isn't difficult to read the short statement.

    
SELECT COUNT(ranked_recency.*) AS post_row_count
FROM
  (
     SELECT id, post_id, community_id, published,
        rank() OVER (
           PARTITION BY community_id
           ORDER BY published DESC, id DESC
           )
     FROM post_aggregates) ranked_recency
WHERE rank &lt;= 1000
;

Gives me the expected results over the 5+ million test rows I ran it against.

If you could elaborate on your idea of TOP, please do. I'm hoping there might be a way to wall the LIMIT 1000 into the inner query and not have the outer query need to WHERE filter rank on so many results?

2y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

Jump

What problem are you trying to solve?

Reproducible regular server crashes from queries taking tens of seconds long because the whole logic is based on no WHERE clause that has any meat to it. The server overloads in the field have been going on every single day that I've been here testing the big servers since May 2023.

If I want to look through 1000 posts in a community, I probably want to look at more than 1000.

I'm well aware of the push back. Everyone chimes in saying they want counting to be real time, the developers seem to avoid caching at all cost, and out of desperation - I'm trying to build some kind of basic sanity logic into the system so it doesn't plow through 5 million rows to do a LIMIT 10 query.

Right now Lemmy works perfectly fine with no personalization. Anonymous users - it works great. If you want to read a million posts, it works great. Start blocking specific users, start adding in NSFW filters, cherry-picking a blend of communities, etc. and the problems show up. The ORM logic is difficult to follow, based on massive JOIN of every field there is in many tables, and at certain data thresholds with per-account preferences engaged - it goes off the rails into the pile of over 1 million posts (taking 40 seconds to list page = 1 of LIMIT 20 posts for even a single community).

The programmers who built the code for over 4 years don't seem to think it is an urgent problem. So I'm chipping in. I personally have never worked with this ORM and I find it painful compared to the hand-crafted SQL I've done on major projects. I'm doing this because I feel like nobody else has for months.

2y ago

How do I make a web request to get new posts from a Lemmy community?

Jump

FYI: !lemmydev@lemm.ee

2y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

Jump

ok, experimenting on a massive test data set of over 5 million posts... this PostgreSQL works pretty well

    
SELECT COUNT(ranked_recency.*) AS post_row_count
FROM
  (
     SELECT id, community_id, published,
        rank() OVER (
           PARTITION BY community_id
           ORDER BY published DESC, id DESC
           )
     FROM post_aggregates) ranked_recency
WHERE rank &lt;= 1000
;

This limits any one community to 1000 posts, picking the most recent created posts. This gives a way to age out older data in very active communities without removing any posts at all for small communities.

2y ago

Can I accomplish this in a single SQL statement?

Jump

Ok, I'm doing some reading: https://medium.com/@amulya349/how-to-select-top-n-rows-from-each-category-in-postgresql-39e3cfebb020

2y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

Jump

That is, give a “next page” token

There's already a pull request on changing paging.

My focus is a very hard wall on performance, scale. There is way too much potential for data to run into the full post table as things are now.

2y ago

deleting a root comment takes the entire comment chain down with you?

Jump

I'll say this: a lot of discussion seems to take place on Matrix chat that doesn't make it into GitHub code comments as to why specific changes are made.

It used to be you could see the actual content of deleted comments and it was at the discretion of the client to show or not show them. The newcomers from Reddit (June) seemed to not like that people could read content of deleted comments, so I think changes were made for that reason.

With federation, it really isn't reasonable to expect content copies to all be deleted. So it's a complex issue.

2y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

Jump

It can't be a simple as a date range, because we want to be inclusive for smaller communities.

paging is a consideration. 1000 posts per community would allow 10 pages of 20 posts.
small communities are defined to be 1000 or less posts, regardless of age
large communities would focus on recency, the 1000 post would be recently created or edited
Edited can be more tricky, either skip for now or focus on how to limit some kind of mass edit from taking over newly published

Also a good time to be reminded that the published date isn't reliable for a couple reasons:

problems in the field have been shown with incoming federation data having future published dates on content. kbin in an easy example, but it isn't limited to kbin.
federation can lag due to server overload and problems paths between specific servers, ISP issues, etc. It is rather common to have received a post hours after the published date. Lemmy currently does not track the 'received' date of content.

2y ago

deleting a root comment takes the entire comment chain down with you?

Jump

With 0.18.2, 0.18.3 there were changes in the behavior of comment sorting and delete / remove behavior. It is entirely possible that behavior changed, intentional or otherwise.

There is a !test@lemmy.ml community where you could create comments, do some screen shots before and after delete.

2y ago

Cannot create a post with a title of "0! = 1"

Jump

lemmy-ui is still pretty bad about presenting spinning graphics when encountering an error. As for why the title isn't rejected, maybe it's too short, I don't know the length minimum.

2y ago

My Unified Field Theory Of Shutdown And Meltdown

Jump

Thank you for sharing

2y ago

If everyone was spread out onto different instances, and communities were based all over the fediverse, the decisions of one instance would be less impactful.

Jump

An Instance is just another word for 'server' in lemmy terminology. HDTV is a classic form of media that doesn't involve TCP/IP to watch films and other video content.

2y ago

Cannot create a post with a title of "0! = 1"

Jump

i'm curious about alternate front-end / API clients....

2y ago

If everyone was spread out onto different instances, and communities were based all over the fediverse, the decisions of one instance would be less impactful.

Jump

If everyone was spread out onto different instances

Each instance with an owner/operator making rules... that the average social media user walks in, orders a drink, and starts smoking without any concern that neither one may be allowed. People can be loyal to their media outlets even when it is beyond obvious they are bad. People raised on storybooks that endorse bad behaviors and values, HDTV networks, and social media too. Audience desire to "react comment" to images and not actually read what others have commented - nor learn about the venue operators and reasons for rules is pretty much the baseline experience in 2023.

2y ago

Lemmy since the reddit collapse

Jump

When it comes to media attraction, what they call themselves (labels) don't really matter that much. It's the praise of strong men, authority, that crosses all mythological media systems. Be it bowing down to a burning bush story, Fox News, or Kremlin.

2y ago

Community discovery on self hosted Lemmy instances

Jump

Keep in mind that you’re going to be retrieving and storing a huge amount of data running these scripts

And you are adding to the overload of lemmy.world, beehaw, lemmy.ml, etc who have all the popular content communities. Federation has a lot of overhead, as does having to distribute a community one vote at a time to 500 subscribed servers.

RoundSparrow @ RoundSparrow @lemmy.ml "Finnegans Wake is the greatest guidebook to media study ever fashioned by Read more Posts 127Comments 598Joined 2 yr. ago Moderating

RoundSparrow @ RoundSparrow @lemmy.ml

Posts

127
Comments

598
Joined

2 yr. ago