How does this pic show that Elon Musk doesnt know SQL?
I'm a tech interested guy. I've touched SQL once or twice, but wasn't able to really make sense of it. That combined with not having a practical use leaves SQL as largely a black box in my mind (though I am somewhat familiar with technical concepts in databasing).
With that, I keep seeing [pic related] as proof that Elon Musk doesn't understand SQL.
Can someone give me a technical explanation for how one would come to that conclusion? I'd love if you could pass technical documentation for that.
There can be duplicate SSNs due to name changes of an individual, that's the easiest answer. In general, it's common to just add a new record in cases where a person's information changes so you can retain the old record(s) and thus have a history for a person (look up Slowly Changing Dimensions (SCD)). That's how the SSA is able to figure out if a person changed their gender, they just look up that information using the same SSN and see if the gender in the new application is different from the old data.
Another accusation Elon made was that payments are going to people missing SSNs. The best explanation I have for that is that various state departments have their own on-premise databases and their own structure and design that do not necessarily mirror the federal master database. There are likely some databases where the SSN field is setup to accept strings only, since in real life, your SSN on your card actually has dashes, those dashes make the number into a string. If the SSN is stored as a string in a state database, then when it's brought over to the federal database (assuming the federal db is using a number field instead of text), there can be some data loss, resulting in a NULL.
It’s so basic that documentation is completely unnecessary.
“De-duping” could mean multiple things, depending on what you mean by “duplicate”.
It could mean that the entire row of some table is the same. But that has nothing to do with the kind of fraud he’s talking about. Two people with the same SSN but different names wouldn’t be duplicates by that definition, so “de-duping” wouldn’t remove it.
It can also mean that a certain value shows up more than once (eg just the SSN). But that’s something you often want in database systems. A transaction log of SSN contributions would likely have that SSN repeated hundreds of times. It has nothing to do with fraud, it’s just how you record that the same account has multiple contributions.
A database system as large as the SSA has needs to deal with all kinds of variations in data (misspellings, abbreviations, moves, siblings, common names, etc). Something as simplistic as “no dupes anywhere” would break immediately.
It's an insanely idiotic thing to say. Federal government IT is myriad, and done at a per agency level. Any relational database system, which the federal government uses plenty of, uses SQL in one way or another. Elon doesn't know what he is talking about at all, and is being an ultimate idiot about this. Even in the context of mainframe projects thatif we are giving elong the benefit of doubt about referring to, most COBOL shoprbibknow have adapted to addressing internal data records using an SQL interface, although obviously in that legacy world it is insanely fractured and arcane.
Having never seen the database schema myself, my read is that the SSN is used as a primary key in one table, and many other tables likely use that as a foreign key. He probably doesn't understand that foreign keys are used as links and should not be de-duplicated, as that breaks the key relationship in a relational database.
As others have mentioned, even in the main table there are probably reused or updated SSNs that would then be multiple rows that have timestamps and/or Boolean flags for current/expired.
TL;DR de-deuplication in that form is used to refer a technique where you reference two different pieces of data in the file system, with one single piece of data on the drive, the intention being to optimize file storage size, and minimize fragmentation.
You can imagine this would be very useful when taking backups for instance, we call this a "Copy on Write" approach, since generally it works by copying the existing file to a second reference point, where you can then add an edit on top of the original file, while retaining 100% of the original file size, and both copies of the file (its more complicated than this obviously, but you get the idea)
now just to be clear, if you did implement this into a DB, which you could do fairly trivially, this would change nothing about how the DB operates, it wouldn't remove "duplicates" it would only coalesce duplicate data into one single tree to optimize disk usage. I have no clue what elon thinks it does.
The problem here, as a non programmer, is that i don't understand why you would ever de-duplicate a database. Maybe there's a reason to do it, but i genuinely cannot think of a single instance where you would want to delete one entry, and replace it with a reference to another, or what elon is implying here (remove "duplicate" entries, however that's supposed to work)
Elon doesn't know what "de-duplication" is, and i don't know why you would ever want that in a DB, seems like a really good way to explode everything,
To me I'm not really sure what his reply even means. I think it's some attempt at a joke (because of course the government uses SQL), but I figure the joke can be broken down into two potential jokes that fail for different, embarrassing reasons:
Interpretation 1: The government is so advanced it doesn't use SQL - This interpretation is unlikely given that Elon is trying to portray the government as in need of reform. But it would make more sense if coming from a NoSQL type who thinks SQL needs to be removed from everywhere. NoSQL Guy is someone many software devs are familiar with who takes the sometimes-good idea of avoiding SQL and takes it way too far. Elon being NoSQL Guy would be dumb, but not as dumb as the more likely interpretation #2.
Interpretation 2: The government is so backward it doesn't use SQL - I think this is the more likely interpretation as it would be consistent with Elon's ideology, but it really falls flat because SQL is far from being cutting-edge. There has kind of been a trend of moving away from SQL (with considerable controversy) over the last 10 years or so and it's really surprising that Elon seems completely unaware of that.
The statement "this [guy] thinks the government uses SQL" demonstrates a complete and total lack of knowledge as to what SQL even is. Every government on the planet makes extensive and well documented use of it.
The initial statement I believe is down to a combination of the above and also the lack of domain knowledge around social security. The primary key on the social security table would be a composite key of both the SSN and a date of birth—duplicates are expected of just parts of the key.
If he knew the domain, he would know this isn't an issue. If he knew the technology he would be able to see the constraint and following investigation, reach the conclusion that it's not an issue.
"The government" is multiple agencies and departments. There is no single computer system, database, mainframe, or file store that the entire US goverment uses. There is no standard programming language used. There is no standard server configuration. Each agency is different. Each software project is different.
When someone says the government doesn't use sql, they don't know what they are talking about. It could be refering to the fact that many government systems are ancient mainframe applications that store everything in vsam. But it is patently false that the government doesn't use sql. I've been on a number of government contracts over the years, spanning multiple agencies. MsSQL was used in all but one.
Furthermore, some people share SSNs, they are not unique. It's a common misconception that they are, but anyone working on a government software learns this pretty quickly. The fact that it seems to be a big shock goes to show that he doesn't know what he is doing and neither do the people reporting to him.
Not only is he failing to understand the technology, he is failing to understand the underlying data he is looking at.
As a data engineer for the past 20+ years:
There is absolutely no fucking way that the us gov doesnt use sql. This is what shows that he’s stupid not only in sql but in data science in general.
Regarding duplications: its more nuanced than those statements each side put. There can be duplications in certain situations. In some situations there shouldnt be. And I dont really see how duplications in a db is open to fraud.
Because of course the government uses SQL. It's as stupid as saying the government doesn't use electricity or something equally stupid. The government is myriad agencies running myriad programs on myriad hardware with myriad people. My damned computers at home are using at least 2-3 SQL databases for some of the programs I run.
SQL is damn near everywhere where data sets are found.
Musk's statement about the government not using SQL is false. I worked for FEMA for fourteen years, a decade of which was as a Reports Analyst. I wrote Oracle SQL+ code to pull data from a database and put it into spreadsheets. I know, I know. You're shocked that Elon Musk is wrong. Please remain calm.
To oversimplify, there are two basic kinds of databases: SQL (Structured Query Language, usually pronounced like "sequel" or spelled aloud) and noSQL ("Not Only SQL").
SQL databases work as you'd imagine, with tables of rows and columns like a spreadsheet that are structured according to a fixed schema.
NoSQL includes all other forms of databases, document-based, graph-based, key-value pairs, etc.
The former are highly consistent and efficient at processing complicated queries or recording transactions, while the latter are more flexible and can be very fast at reads/writes but are harder to keep in sync as a result.
All large orgs will have both types in use for different purposes; SQL is better for banking needs where provable consistency is paramount, NoSQL better for real-time web apps and big data processing that need minimal response times and scalable capacity.
That Musk would claim the government doesn't use SQL immediately betrays him as someone who is entirely unfamiliar with database administration, because SQL is everywhere.
How come republicans keep saying that doggy is going to expose all the fraud in the government but yet the biggest fraud with 37 felonies is president? What the actual fuck to these people think?
I think a lot of comments here miss the mark, it's not really just about stating the gov does not use SQL or speculation regarding keys.
Deduplication is generally part of a compression strategy and has nothing to do with SQL. If we're being generous he may have been talking about normalization, but no one I have ever met has confused the two terms (they are distinctly different from an engineering perspective).
There are degrees of normalization too, so it may make total sense to normalize 3NF (third normal form) rather than say 6NF depending on the data.
If he doesn't think the government uses sql after having his goons break into multiple government servers he is an idiot.
If he is lying to cover his ass for fucking up so many things (the more likely explanation) then saying "he never used sql" is basically a dig at how technically inept he really is despite bragging about being a tech bro.
The US government pays lots of money to Oracle to use their database. And it's not for BerkleyDB either. (Poor sleepy cat). Oracle provides them support for their relational databases... and those databases use... SQL.
Now if Musk tries to end the Oracle contracts, then Oracle's lawyers will go after his lawyers and I'm a gonna get me some popcorn. (But we all know that won't happen in any timeline... Elon gotta keep Larry happy.)
If SSNs are used as a primary key (a unique identifier for a row of data) then they'd have to be duplicated to be able to merge data together.
However, even if they aren't using ssn as an identifier as it's sensitive information. It's not uncommon to repeat data either for speed/performance sake, simplicity in table design, it's in a lookup table, or you have disconnected tables.
Having a value repeated doesn't tell you anything about fraud risk, efficency, or really anything. Using it as the primary piece of evidence for a claim isn't a strong arguement.
I'm still learning SQL, so if I'm out of line someone please correct me, but, the gist of it, is that SQL (Structured Query Language) is a language used in pretty much all relational databases, which with something like the Social Security database is almost guaranteed. Having duplicates of information in a relational database is not a sign of fraud, or anything shady going on.
When you're born, your name, along with your SSN and any other relevant info is put into the database, later in life, say you change your name, the original name, along with your SSN will stay there, and a new line in the database would be added with your new name, along with your SSN again (a duplicate) that way the database has a reference point between old and new name, and keeps all your information lined up between the two.
If you were to get rid of all of that duplicate information, anyone who's ever had a name change, been married, etc. It will cause chaos in the database, with hundreds of millions of entries that now have no relation to anything, and are now just basically dead ends.
I saw a comment about this in the last couple of days that was really interesting and educational. Unfortunately I can't seem to find it again to link it, but the gist of it was that there would be two things wrong with using SSNs as primary keys in a SQL database:
You should not use externally generated data as primary keys
You should not use personally identifying data as primary keys
Using SSNs as keys would violate both.
I went looking for best practices regarding SQL primary keys and found this really interesting post and discussion on Stack Overflow:
My first thought was that people's SSNs can and do change, and sometimes (rarely?) people may have more than one SSN. Like someone mentions in that link, human error would be another reason why you would not want to use external data and particularly SSNs as primary keys.
TIL Elon doesn't know SQL or have any basic human decency.
J/K, I already knew he doesn't have basic human decency.
If he knew anything about SQL, he could have run a quick search to see whether any SSNs are actually duplicated. (spoiler alert: they're not, he's just stupid).
I mean suggesting the government doesn't use SQL, in tech-speak is about as dumb as saying the government doesn't use numbers.
Government is full of what are known as relational databases as you are well aware, and though it stands to reason that they aren't all using the same software to manage it, many can be accessed using a standard language of commands. It could be a Microsoft Access, MySQL, MariaDB, Oracle Derby, Microsoft SQL server, PostgresQL, SQLite, SAP HANA, so on and so forth. That language is Structured Query Language (SQL).
And saying there can be multiple entries in a database for one item with respect to the Social Security Database is, to me, a silly distraction and spreading BS FUD to ignorant people. As others have already mentioned, most databases have an internal sequence (keyID) number that is unrelated to the personal ID number of the person whose data is collected.