@mm_maybe

mm_maybe@sh.itjust.works · 2 days ago

Y’all should really stop expecting people to buy into the analogy between human learning and machine learning i.e. “humans do it, so it’s okay if a computer does it too”. First of all there are vast differences between how humans learn and how machines “learn”, and second, it doesn’t matter anyway because there is lots of legal/moral precedent for not assigning the same rights to machines that are normally assigned to humans (for example, no intellectual property right has been granted to any synthetic media yet that I’m aware of).

That said, I agree that “the model contains a copy of the training data” is not a very good critique–a much stronger one would be to simply note all of the works with a Creative Commons “No Derivatives” license in the training data, since it is hard to argue that the model checkpoint isn’t derived from the training data.

mm_maybe@sh.itjust.works · 2 days ago

Yeah, I’ve struggled with that myself, since my first AI detection model was technically trained on potentially non-free data scraped from Reddit image links. The more recent fine-tune of that used only Wikimedia and SDXL outputs, but because it was seeded with the earlier base model, I ultimately decided to apply a non-commercial CC license to the checkpoint. But here’s an important distinction: that model, like many of the use cases you mention, is non-generative; you can’t coerce it into reproducing any of the original training material–it’s just a classification tool. I personally rate those models as much fairer uses of copyrighted material, though perhaps no better in terms of harm from a data dignity or bias propagation standpoint.

mm_maybe@sh.itjust.works · 2 days ago

Model sizes are larger than their training sets

Excuse me, what? You think Huggingface is hosting 100’s of checkpoints each of which are multiples of their training data, which is on the order of terabytes or petabytes in disk space? I don’t know if I agree with the compression argument, myself, but for other reasons–your retort is objectively false.

mm_maybe@sh.itjust.works · 2 days ago

I’m getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this “there isn’t enough free data” claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization’s specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.

mm_maybe@sh.itjust.works · 2 days ago

The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works. This has been suppressed by OpenAI in a rather brute force kind of way, by prohibiting the prompts that have been found so far to do this (e.g. the infamous “poetry poetry poetry…” ad infinitum hack), but the possibility is still there, no matter how much they try to plaster over it. In fact there are some people, much smarter than me, who see technical similarities between compression technology and the process of training an LLM, calling it a “blurry JPEG of the Internet”… the point being, you wouldn’t allow distribution of a copyrighted book just because you compressed it in a ZIP file first.

mm_maybe@sh.itjust.works · 3 days ago

Yeah, I would agree that there’s something really off about the framework that just doesn’t fit most people’s feelings of justice or injustice. A synth YouTuber, of all people, made a video about this that I liked, though his proposed solution is about as workable as Jaron Lanier’s: https://youtu.be/PJSTFzhs1O4?si=ZvY9yfOuIJI7CVUk

Again, I don’t have a proposal of my own, I’ve just decided for myself that if I’m going to do anything money-making with LLMs in my practice as a professional data scientist, I’ll rely on StarCoder as my base model instead of the others, particularly because a lot of my clients are in the public sector and face public scrutiny.

mm_maybe@sh.itjust.works · 3 days ago

yes, I’ve extensively written about Phi and other related issues in a blog post which I’ll share here: https://medium.com/@matthewmaybe/data-dignity-is-difficult-64ba41ee9150

mm_maybe@sh.itjust.works · 4 days ago

I’m not proposing anything new, and I’m not here to “pitch” anything to you–read Jaron Lanier’s writings e.g. “Who Owns the Future”, or watch a talk/interview given by him, if you’re interested in a sales pitch for why data dignity is a problem worth addressing. I admire him greatly and agree with many of his observations but am not sure about his proposed solution (mainly a system of micro-payments to creators of the data used by tech companies)–I’m just here to point out that copyright infringement isn’t in fact, the main nor the only thing that is bothering so many people about generative AI, so settling copyright disputes isn’t going to stop all those people from being upset about it.

As to your comments about “feelings”, I would turn it around to you and ask why it is important to society that we prioritize the feelings (mainly greed) of the few tech executives and engineers who think that they will profit from such practices over the many, many people who object to them?

mm_maybe@sh.itjust.works · 4 days ago

Scaling laws are disputed, but if an effort has in fact already been undertaken to train a general purpose LLM using only permissively-licensed data, great! Can you send me the checkpoint on Huggingface, a github page hosting relevant code, or even a paper or blog post about it? I’ve been looking and hadn’t found anything like that yet.

mm_maybe@sh.itjust.works · 4 days ago

What irks me most about this claim from OpenAI and others in the AI industry is that it’s not based on any real evidence. Nobody has tested the counterfactual approach he claims wouldn’t work, yet the experiments that came closest–the first StarCoder LLM and the CommonCanvas text-to-image model–suggest that, in fact, it would have been possible to produce something very nearly as useful, and in some ways better, with a more restrained training data curation approach than scraping outbound Reddit links.

All that aside, copyright clearly isn’t the right framework for understanding why what OpenAI does bothers people so much. It’s really about “data dignity”, which is a relatively new moral principle not yet protected by any single law. Most people feel that they should have control over what data is gathered about their activities online, as well as what is done with those data after it’s been collected, and even if they publish or post something under a Creative Commons license that permits derived uses of their work, they’ll still get upset if it’s used as an input to machine learning. This is true even if the generative models thereby created are not created for commercial reasons, but only for personal or educational purposes that clearly constitute fair use. I’m not saying that OpenAI’s use of copyrighted work is fair, I’m just saying that even in cases where the use is clearly fair, there’s still a perceived moral injury, so I don’t think it’s wise to lean too heavily on copyright law if we want to find a path forward that feels just.

mm_maybe@sh.itjust.works · 11 days ago

Last I heard he’s folding it into Pixelfed, kind of like Reels is part of Instagram. I’m psyched to try it whenever it becomes publicly available, either way. Hoping it’s more like Vine than TikTok

mm_maybe@sh.itjust.works · 22 days ago

Capitalism is precisely the problem, because if the end product were never sold nor used in any commercial capacity, the case for “fair use” would be almost impossible to challenge. They’re betting on judges siding with them in extending a very specific interpretation of fair use that has been successfully applied to digital copying of content for archival and distribution as in e.g. Google Books or the Internet Archive, which is also not air-tight, just precedent.

Even fair uses of media may not respect the dignity of the creators of works used to create “media synthesizers”. In other words, even if a computer science grad student does a bunch of scraping for their machine learning dissertation, unless they ask and get permission from the creators, their research isn’t upholding the principle of data dignity, which current law doesn’t address at all, but is obviously the real issue upsetting people about “Generative AI”.

mm_maybe@sh.itjust.works · 1 month ago

Be me

Early adopter of LLMs ever since a random tryout of Replika blew my mind and I set out to figure what the hell was generating its responses

Learn to fine-tune GPT-2 models and have a blast running 30+ subreddit parody bots on r/SubSimGPT2Interactive, including some that generate weird surreal imagery from post titles using VQGAN+CLIP

Have nagging concerns about the industry that produced these toys, start following Timnit Gebru

Begin to sense that something is going wrong when DALLE-2 comes out, clearly targeted at eliminating creative jobs in the bland corporate illustration market. Later, become more disturbed by Stable Diffusion making this, and many much worse things, possible, at massive scale

Try to do something about it by developing one of the first “AI Art” detection tools, intended for use by moderators of subreddits where such content is unwelcome. Get all of my accounts banned from Reddit immediately thereafter

Am dismayed by the viral release of ChatGPT, essentially the same thing as DALLE-2 but text

Grudgingly attempt to see what the fuss is about and install Github Copilot in VSCode. Waste hours of my time debugging code suggestions that turn out to be wrong in subtle, hard-to-spot ways. Switch to using Bing Copilot for “how-to” questions because at least it cites sources and lets me click through to the StackExchange post where the human provided the explanation I need. Admit the thing can be moderately useful and not just a fun dadaist shitposting machine. Have major FOMO about never capitalizing on my early adopter status in any money-making way

Get pissed off by Microsoft’s plans to shove Copilot into every nook and cranny of Windows and Office; casually turn on the Opympics and get bombarded by ads for Gemini and whatever the fuck it is Meta is selling

Start looking for an alternative to Edge despite it being the best-performing web browser by many metrics, as well as despite my history with “AI” and OK-ish experience with Copilot. Horrified to find that Mozilla and Brave are doing the exact same thing

Install Vivaldi, then realize that the Internet it provides access to is dead and enshittified anyway

Daydream about never touching a computer again despite my livelihood depending on it

</greentext>

mm_maybe@sh.itjust.works · 2 months ago

Tangential, but I absolutely loved working in technical support. The satisfaction of actually helping someone with a problem affecting their real life totally outweighed the abuse from individuals who were letting the work part of their life drag the whole rest of it down (which was just kind of sad to watch). I’ve gotten paid much more for other roles since then, but it’s one of the few roles in which I was thanked for what I did by the person I was working for, and that makes a huge difference.

mm_maybe@sh.itjust.works · 3 months ago

Running such a bot with an intentionally underpowered language model that has been trained to mimic a specific Reddit subculture is good clean absurdist parody comedy fun if done up-front and in the open on a sub that allows it, such as r/subsimgpt2interactive, the version of r/subsimulatorgpt2 that is open to user participation.

But yeah, fuck those ChatGPT bots. I recently posted on r/AITAH and the only response I got was obviously from a large language model… it was infuriating.