# Implementing machine learning in C

time to implement machine learning in c

1. 6 months ago
Anonymous

no

2. 6 months ago
Anonymous

I tried doing it one night after drinking but i couldn't figure out error calculation and backtracking so i gave up and went to sleep

• 6 months ago
Anonymous

Read a book that will teach you those things, and explain the meaning of them.

• 6 months ago
Anonymous

Read a book that will teach you those things, and explain the meaning of them.

>Read a book that will teach you those things, and explain the meaning of them.
I don't really have a background in math so those books make about as much sense to me as Latin
I am an engineer though, so i do a lot of Excel and numpy/pandas related stuff, it's just theoretical math that i have trouble with. If someone could show me a guide or video showing 2 hidden layers being backtracked with actual formulas and numbers, I'll be able to copy that.

• 6 months ago
Anonymous

>I don't really have a background in math
>I am an engineer though
Pardon?

• 6 months ago
Anonymous

>Pardon?
Anon.... It may come as a surprise to you but even though we EEs study Maxwell equations, advanced calculus and other mad math topics, we don't actually understand or use that stuff very often.
The most i use for my job is maybe highschool calculus and algebra sometimes univ level trigonometry.

• 6 months ago
Anonymous

Not even Fourier transforms? In any case check out this book, I'm pretty certain it was written by an electrical engineer (going from memory). I don't think the math requirements should include anything you haven't seen. Taking a quick flick through the pages, the most "advanced" thing I saw was a gradient.

• 6 months ago
Anonymous
• 6 months ago
Anonymous

>engineer
>doesnt know math
Choose one!

• 6 months ago
Anonymous

To start of with the chain rule of derivatives is important, it means that if we want to find the gradient of a weight with respect to the loss we get at the end of a forward pass, we can find it by multiplying the gradients in between, for example in the image (DL/DW = DL/DA * DA/DS * DS/DW)
this means that we can step backwards calculating each gradient as we go.

to find the gradient of the weight we can start by finding the gradient of the Loss with respect to the Activation(s) (for MSE loss) is (2/N) *(Target-Activation). Where N is the number of output units
The Gradient of the unit input to the activation (for RELU) is: IF x > 0 : 1, ELSE 0
The gradient of the Weight and Unit Input is Activation(prev) where Activation(prev) is either the output of the unit at the start of the weight, or in this case the X input.

putting this together we can find the gradient of the weight WRT the Loss.
DL/DW = DL/DA * DA/DS * DS/DW
DL/DW = 2*(Target-Activation) * 1 * X
DL/DW = 2*(0.6-0.25) * 1 * 0.5 = 0.35
Then you can adjust the weights using Weight = Weight - LearningRate * DL/DW

outside of trivial networks like this the DL/DA is actually: SUM over J (DL/DInput * Weight(This -> J). Where J is the set of units that take in the output of the activation.

If you dont understand anything. :shrug:

• 6 months ago
Anonymous

>ReLU
Into the trash it goes

• 6 months ago
Anonymous

then use another activation. doesn't change anything.

• 6 months ago
Anonymous

its a mongo example, i cant be assed writing the derivative of anything more complicated RN.
also, I LOVE SPARSE GRADIENTS.

https://i.imgur.com/b4jUbHx.png

To start of with the chain rule of derivatives is important, it means that if we want to find the gradient of a weight with respect to the loss we get at the end of a forward pass, we can find it by multiplying the gradients in between, for example in the image (DL/DW = DL/DA * DA/DS * DS/DW)
this means that we can step backwards calculating each gradient as we go.

to find the gradient of the weight we can start by finding the gradient of the Loss with respect to the Activation(s) (for MSE loss) is (2/N) *(Target-Activation). Where N is the number of output units
The Gradient of the unit input to the activation (for RELU) is: IF x > 0 : 1, ELSE 0
The gradient of the Weight and Unit Input is Activation(prev) where Activation(prev) is either the output of the unit at the start of the weight, or in this case the X input.

putting this together we can find the gradient of the weight WRT the Loss.
DL/DW = DL/DA * DA/DS * DS/DW
DL/DW = 2*(Target-Activation) * 1 * X
DL/DW = 2*(0.6-0.25) * 1 * 0.5 = 0.35
Then you can adjust the weights using Weight = Weight - LearningRate * DL/DW

outside of trivial networks like this the DL/DA is actually: SUM over J (DL/DInput * Weight(This -> J). Where J is the set of units that take in the output of the activation.

If you dont understand anything. :shrug:

I've attached a less schitzo and more readable version of the equations for backprop in general.

• 6 months ago
Anonymous

Thanks for the explanation

• 6 months ago
Anonymous

>backtracking
backpropagation

• 6 months ago
Anonymous

>backpropogation
Just use a genetic algorithm lol. Its only like 3 times slower than gradient descent backprop but at least does not get stuck in a local minimum all the time. If you do some architecture frickery you can even prevent overfitting

• 6 months ago
Anonymous

>does not get stuck in a local minimum all the time
Hyperparameters issue

• 6 months ago
Anonymous

theres no gaurantee you reach the global minimum given any hyperparameter

• 6 months ago
Anonymous

Yes and?

• 6 months ago
Anonymous

which means you stuck at local minimum all the time

• 6 months ago
Anonymous

>which means you stuck at local minimum all the time
Not reaching the global minimum with a given set of hyperparameters doesn't mean you're never reaching it with any set of hyperparameters.

• 6 months ago
Anonymous

then it would be a luck issue, not hyperparameter

• 6 months ago
Anonymous

SGD is theortically and practically superior. It turns out SGD is the best approximation algorithm for ERM learning which is NP-hard.

3. 6 months ago
Anonymous

not hard, like 20 lines of code, been done 40 years ago. and aside from some attention layer extras and moving buffers to the GPU, it's more or less still the same thing powering LLMs today

4. 6 months ago
Anonymous

>makes it run at kernel level
What happens?

• 6 months ago
Anonymous

Judgement day.

5. 6 months ago
Anonymous

the final redpill is to do a forward pass, calculate the error, then pick a random neuron and change it to see if error goes down or not. repeat until you reach the desired performance

• 6 months ago
Anonymous

you just went full moron

6. 6 months ago
Anonymous

99% of the effort is in autograd. Good luck.

7. 6 months ago
Anonymous

Would there be massive improvement in training efficiency if training LLMs are done in C as opposed to python which is what is commonly done in the research realm?

• 6 months ago
Anonymous

no, the training algorithms are already written in cuda. python is just a wrapper for it all

• 6 months ago
Anonymous

Not really, the computationally costly part are already done in C
If you want a easier way to speed up common code write the algos in Julia

• 6 months ago
Anonymous

>Not really, the computationally costly part are already done in C
lies.
C has no auto-vectorization and is too slow.
for linear algebra, the computationally costly stuff, you would use BLAS in C. BLAS is written in Fortran which actually has auto-vectorization.

• 6 months ago
Anonymous

>BLAS is written in Fortran
*was. It's all C++ now. Boomers who know Fortran are dying faster than we can replace them with C++ fact. We will rewrite BLAS in Rust in the next few decades at this rate.

• 6 months ago
Anonymous

Except rust is dying already.

• 6 months ago
Anonymous

Microsoft has literally given Rust billions of dollars.

• 6 months ago
Anonymous

Wait what, why?

• 6 months ago
Anonymous

llama doesnt use relu
history, relu was not used because it didn't model a biological neutron correctly. alexnet said that s relu allowed model training to be several times more efficient, so relu was used every since then.
apparently the newest networks don't even use activation functions at all.

• 6 months ago
Anonymous

unless they've got something really fricky they still need an activation function of some sort to prevent linearity.
Obvs i dont have my ear to the ground on that type of NN's but maybe its just a different type of ML

• 6 months ago
Anonymous

yeah it sounded weird when i read it. they are called state space models. they sicked until last week until a paper by a single author improved them with a few obvious-in-hindsight tricks to be a gajillion percent better than transformers... at like just a few partners, no one has scanned them up yet, at least not the newer variants.

• 6 months ago
Anonymous

holy cow, swype typig is quite inaccurate

• 6 months ago
Anonymous

hey man, stop drinking beer, ok? that's a good boy.

• 6 months ago
Anonymous

nta and it wasn't billions it was millions. The reason is that the Microsoft CTO is a Rust fanboy, also all 70% of their security vulnerabilities are caused by bugs that are not possible in Rust

• 6 months ago
Anonymous

there's no way you would write GEMM in C or C++. it's either Fortran or hand-tuned assembly.

• 6 months ago
Anonymous

>he thinks people haven't optimized AI in every possible way

>yeah my random saturday idea would help the world advance.
>why is the world so dumb and I'm so good

• 6 months ago
Anonymous

He's just asking a question you Black person, no need to create a fake scenario in your head

• 6 months ago
Anonymous

asking an obviously stupid question is called baiting

• 6 months ago
Anonymous

Black person, the heavy lifting is done in C++, the king of performance. Python is just the interface. If you did them in C it would literally be a downgrade.

8. 6 months ago
Anonymous

I wrote a neural network without a framework in dartlang in undergrad and php right out of college.

9. 6 months ago
Anonymous

What cool project can you do with only a little neural network?
Like if you do it in C using only a few gigs of ram and no GPU.

• 6 months ago
Anonymous

bert, but for things other than language
of course then you have to use its output vector for something

• 6 months ago
Anonymous

>bert, but for things other than language
like what? can you even use a language model for other than language generation stuff?

• 6 months ago
Anonymous

anything sequential, not not time-series... hmm... how about clarifying Internet sequence? Algorithms for calculating the nth digit of Pi are extremely common and very diverse. Can bert auto-learn this? What about e? if you feed 10 digits of pi to bert, will it detect that the sequence is pi-like or e-like?

if you look up recent state space model papers, those models have modes where they can be scaled to bahave sequence-favoring (transformers, quadratic comparison complexity) to series-favoring (special ssm sauce, either linear or log complexity). you can imagine making a tiny bert with state space models configured to be transformer-like and then tweak it from there.

as an aside, berts do not normally do language generation, but recently they have been popular for that purpose. normally they are envisioned as encoders if text or sentences.

i wonder why no one embeds berts in gpts to do coarse decision making.

• 6 months ago
Anonymous

frick, damnit, why does swype typing never work! okay anon i hope you can decrypt what i just wrote....

10. 6 months ago
Anonymous

Programming could have been so good if only the right systems, paradigms and concepts were implemented earlier on.

The speed of programming could have been 100x faster than the current pace, I am so upset that every single person has got it completely wrong. Far out, far out, we've done everything wrong. If you could see what I can see in my own programs...

• 6 months ago
Anonymous

Please do tell anon, what programming techniques and parardigms have you been using?

• 6 months ago
Anonymous

rust obviously

• 6 months ago
Anonymous

why don't you opine about x86 while you're at it

• 6 months ago
Anonymous

>he isn't writing ISO C
why does anyone get even mad about this. duh you can't afford the spec and you're not gonna read it

• 6 months ago
Anonymous

>Programming could have been so good if only the right systems, paradigms and concepts were implemented earlier on.
This.
The correct paradigm is (and always was) seething and dilating.
We were young and naive, experimenting with new patterns, new languages
But nothing really improved until finally we seethed and got a CoC in our repository.
And dilating was what opened us up, allowing us to be receptive to these new paradigms.

11. 6 months ago
Anonymous

what for? shit get offloaded to GPU anyway. host code literally doesn't matter. python is good enough.

12. 6 months ago
Anonymous

>time to implement machine learning in c

Except that "neural networks" are based on an untested and unproven theory of how the human brain works, and on the presumption that the brain is actually a computer and not really a kind of transceiver that enables us to interact with the aether.

Just like people stupidly and blindly accept the farce of gravity, they much the same way think that when we think thoughts, our brains are "computing" this...but no, that's really not what is happening at all.

• 6 months ago
Anonymous

>our brains are "computing" this...but no, that's really not what is happening at all.
And what do ou think really happens?
Perhaps this is where you should go: >>>/x/

• 6 months ago
Anonymous

frick off Penrose, nobody is buying your book

13. 6 months ago
Anonymous