Life Is Pain, Then You Die: The Blog (Posts about ai)

AI Generated Hacker News Comments Project Breakdown

Siwei Zhu — Wed, 03 Jun 2020 18:49:29 GMT

I've been scraping hacker news comments for another project, and while I had the data I figured I would use it to ship a quick project.

The idea is simple: I pretty much only read the comments on HN, almost never the actual article (ain't nobody got time for that). By training AI to generate comments, you can skip reading the article for anything you want, not just the ones that get a lot of attention.

Originally I envisioned having the model fetch the url and reading the actual article or a summary of it, but in the interest of shipping as close to inside of a weekend as possible the model is predicated on only the title and url. Previous work (salesforce CTRL) has shown that you can generate reasonable news articles from the url alone, so it's not far-fetched to expect reasonable comments from the title/url alone.

Tech Stack

Django + DRF for backend
React for frontend (I previously used Django+DRF+React with additionall Daphne/Channels for websockets for Turnbase)
gpt-2-simple (I have previously used this to generate, uh, interesting works of literature)

Pipeline

Most people are naturally attracted to the modeling aspects of machine learning, but to deploy a production ML system you need to think about/work on the whole pipeline. For this project, I probably spent less than 10% of the total work time finetuning the model. Almost all of the work is in data processing and building the UI.

Data acquisition

This is pretty straightforward as HN has a pretty clean API, but did take a few days to scrape (I didn't count this as part of the project time as nominally it was for another project)

Data cleaning/processing

I originally kept each HN item as its own text file (because didn't want to bother escaping new lines), but this turned out to make things too slow. As it turns out, it's not good to have a huge (23mm) number of very small files. So then I paged them together a few 100ks at a time.

Then, I wrote structures for tree traversal so that you could display the entire comment chain. Now, a child to a particular comment might be stored in another page depending on how much time elapsed between them, so to build out the tree for a particular root, you might need to seek ahead to multiple pages. You can't keep all of the data in memory (maybe you could, I couldn't because my machine wasn't beefy enough...), only a few pages at a time, so I had to write a caching mechanism for this. So this took longer than expected.

Cleaning was not too bad as the data is mostly clean. I mostly just filtered out dead comments.

It was not clear the best way to format the final training text files. How do you let the model know that a comment is a child to another? Do you write out all the children of a root item (a story) in a nested format (so that each story gets one sample in the training file), or do you write it out one reply chain at a time (so that each story generates multiple samples, which you then need to cap to make sure the most popular stories aren't overrepresented), etc.

In the end I used <|c|> and <|ec|> tokens to start/end a children block, sorted the children (HN API gives you the rank but not the score for the children), and limited to 10 children per item (so in theory we get only the best comments) with no limits on the depth (so comment chains can go as long as they want). In theory, the model should learn also the distribution of how many replies an item is likely to get this way (with the cap of 10 slightly modifying the distribution)

The whole thing is dumped into a single file with <|endoftext|> to delimit stories.

Note

Hilariously, I found a bug with how <|endoftext|> is used that may explain some previous weirdnesses I'd seen with gpt-2-simple.

Model training

I kept things basic and didn't spend too much time tuning parameters. I think you need MLFlow or similar if you want to do any real tuning, otherwise you end up with a bunch of models with names like hn_model_lr_0001_final2_noclip, but didn't attempt setting it up for this project.

Used the 335m (medium) gpt2 model, trained on a p3.2xlarge.

Model deployment + serving

The simplest architecture would be to keep a copy of the model on the same machine that serves the website. Whenever someone submits a new story that they want AI generated comments for, the backend would then invoke the model. You wouldn't want to invoke it right away, you'll need to put it in a queue and thread or the website will freeze whenever the model is thinking.

I was a cheapskate and the machine that serves the website is a t2.micro, so inference wasn't even possible as it OOMs

I investigated whether you could get by with a t2.large. You can do inference (no OOM), but with the CPU it's like 40 or 60x slower than a p3.2xlarge (on which inference takes about 30s with overhead for loading the model and stuff, so more like 50s if that's needed). I felt like that would make the user experience too poor.

So instead, there are two machines (website and inference). The website's DB acts as a container for the queue. There is an endpoint to broadcast if there are stories awaiting generation. Then, a script continuously checks the endpoint. When there are stories, the script then sshs into the second machine and kicks off a second script that lives on that machine to do inference (this way I didn't have to set up endpoints on the second machine, saving myself some work, although this SHOULD be fairly straightforward with an MLFlow workflow). This also means that inference can be batched which is a little bit more efficient due to the overhead of loading the model and setting stuff up.

It's realy janky, but it works, and it's cheap. You could set it up so that the second machine is turned off most of the time and only spun up for inference and then shut down (adding slightly to the overhead per inference). It works out pretty well since inference is on the order of a minute which is the minimum time block AWS will bill you for. I just turn it on manually whenever there are things in the queue and I'm paying attention, which is almost never.

Tips/Summary

This post has gone on long enough, so I'll just quickly summarize with a few tips:

Think about your whole pipeline. You need tooling for the entire pipeline so that you can iterate (your data format may change, your encoding may change) quickly.
For gpt2 specifically, encode your data before you finetune.
Inference costs need to be thought about. I was kind of expecting that most of the cost would be in training the model, so it was an unwelcome surprise to discover that I needed a machine as powerful as the training machine to do inference. In fact, for typical production models, the inference cost should way outweigh the training costs (unless you're updating your model constantly), so it's much more important to make sure that inference is efficient and economical. My interest in the distill* family of algorithms has gone up since this project.
Funny enough, gpt3 came out the day after I shipped this project. I don't want to think about deploying that in a production system.

Addendum (2020-07-11)

I tried deploying a GPT2 model with MLFlow. MLFlow can build a docker image for you with the model, artifacts, and any necessary libraries packaged inside. I've used it to deploy simpler model before and prefer this work flow because I don't want to mess with setting up conda and conceptually this is clean. However, I ran into some kind of CUDA error. This seems to be an issue with tensorflow+MLFLow build docker specifically.

Is Simulation Possible?

Siwei Zhu — Sat, 12 Oct 2013 07:00:00 GMT

In “the future”, when I upload my brain into a computer (whatever that means) and delete the physical original, have I transcended into immortal existence or did I just kill myself?

That question is too hard. Let’s start with an easier one: “is strong AI possible?” Here is an argument for why it should be possible:

We don’t understand how consciousness works, exactly, but however it works, it’s based on physical processes, because the brain is a physical object. (The notion of a soul existing in some astral plane separated from physical existence is absurd.) Even if we don’t have a good theory of how cognition works, we do have a good theory of how the universe works. Forget about cognition; it’s too hard. In two years when we have desktop running at ten thousand teraflops or whatever, let’s just simulate the brain at the sub-atomic level, lepton for lepton, leprechaun for leprechaun. Then you have machine sentience.

Implicit in this argument is that simulation is possible. Obviously, simulation is possible in the sense that you can simulate things (subject to computing power, which let’s assume is not an issue, because historical trends are good and the question here is whether simulation is possible in principle); what I mean by this is whether the simulated brain is just as “real” as an actual physical brain. Or, to be even more ambitious, let’s simulate a whole universe, with planets and meadows and people running through trees hunting dinosaurs, etc. Is this universe real? Are those people real?

It’s kind of a stupid question, because if it talks like a real person and acts like a real person, then… Also, you would argue that this universe is real enough to the people inside of it–they can feel the grass between their toes and be warmed by the sun and smell spring and all of that mushy stuff. They would have no idea that their existence is a simulation.

Nevertheless, accepting that the simulated universe is real is philosophically troublesome, because where does this universe exist? It emphatically does not exist in the wires in your fifty exoflop laptop, in its hard drives or transistors. Transistor/electrical is just one way to implement a turing machine; you could use billiard balls or even just do the computations by hand with pen and paper. You could simulate using one form of a turing machine up to some time T, output the simulated state onto some other medium, and resume the computation using another form of a turing machine! The simulated people would not know the difference.

If you accept that this simulated universe exists, you must also accept the following

The simulated universe exists in some platonic realm. There is a one-to-one correspondence with the numbers in your computation and the states of the objects in the simulated universe (variable x corresponds to some parametre of the wave function of this electron, etc.), sure, but the simulated universe does not exist “inside” of your universe.

All that is required for existence in this platonic realm is a representation, which means some descriptors (the numbers in the computation) and a model for interpreting that representation (this number maps to this thing in the simulated universe). As an aside, the descriptors themselves can be instrumented through representations! There are no bits in your computer, there are electrical signals that represent 1 or 0 based on voltage. There are no bits in your harddrive, there are physical pieces of medium that represent 1 or 0 based on magnetization. There are no bits when you write down a number, there are physical marks on paper that represent 1 or 0 based on what they look like. In other words, the bit already lives in the platonic realm.

Questions for next time

If all that’s required for existence is the representation, is it even necessary to run the simulation? Could you just say, here are the equations my simulation would have solved (here is the algorithm my turing machine would have run), here are the initial conditions, and now the simulated universe exists? If you’re in a deterministic universe, specifying the governing equations and initial conditions determines the state at every time point already–the solution exists, even if you don’t know what it is, so what extra value does doing the computation add? How does computing the solution (i.e., doing the simulation) make the simulated universe more real? Even if you’re in a non-deterministic universe, specifying the (probabilistic) governing equations and initial conditions determines every solution that is compatible with said equations and conditions (i.e., they determine every simulated outcome that could have taken place when you run the simulation), so what extra value does doing the computation add? You’re just moving from an implicit representation (the equations + initial conditions, or equivalently the algorithm + initial state) to an explicit one (the computed descriptors at each time step, as represented by bits as represented by whatever physical system your turing machine runs on).

What happens when you make an error in the computation, due to e.g. hardware glitch or human error if doing it by hand?

Glossed over is the fact that what we have are not exact governing laws of the universe (and we may never?), all we have are models to some varying degrees of accuracy, and additionally we may lose accuracy by choosing a coarse spatial or temporal resolution or poor numerical algorithm or not having enough sigfigs in our floating point arithmetic. Does that matter?

Appendix

I used simulating consciousness to motivate what I really wanted to talk about, which was simulating the entirety of existence and what it means to exist. Being able to simulate a universe is sufficient but not necessary for being able to simulate consciousness (again, simulation to me means that the simulated consciousness is “real”), and I think the former is strictly harder while the latter is almost not an interesting question at all. There are arguments such as this and this which supposedly show why simulating consciousness is not possible; to me these arguments lack imagination. “X is unintuitive” is not an argument against x.

The fundamental question re: simulating consciousness is

Is the neuron (a real one) the only structure that is capable of mediating (physically supporting) consciousness?

The answer is a resounding no, because (intelligent) aliens almost certainly exist (if not, then they almost certainly can exist, which is all that’s needed here), and aliens almost certainly do not have the same biological architecture as humans do.