Read a book that will teach you those things, and explain the meaning of them.
>Read a book that will teach you those things, and explain the meaning of them.
I don't really have a background in math so those books make about as much sense to me as Latin
I am an engineer though, so i do a lot of Excel and numpy/pandas related stuff, it's just theoretical math that i have trouble with. If someone could show me a guide or video showing 2 hidden layers being backtracked with actual formulas and numbers, I'll be able to copy that.
>Pardon?
Anon.... It may come as a surprise to you but even though we EEs study Maxwell equations, advanced calculus and other mad math topics, we don't actually understand or use that stuff very often.
The most i use for my job is maybe highschool calculus and algebra sometimes univ level trigonometry.
Not even Fourier transforms? In any case check out this book, I'm pretty certain it was written by an electrical engineer (going from memory). I don't think the math requirements should include anything you haven't seen. Taking a quick flick through the pages, the most "advanced" thing I saw was a gradient.
To start of with the chain rule of derivatives is important, it means that if we want to find the gradient of a weight with respect to the loss we get at the end of a forward pass, we can find it by multiplying the gradients in between, for example in the image (DL/DW = DL/DA * DA/DS * DS/DW)
this means that we can step backwards calculating each gradient as we go.
to find the gradient of the weight we can start by finding the gradient of the Loss with respect to the Activation(s) (for MSE loss) is (2/N) *(Target-Activation). Where N is the number of output units
The Gradient of the unit input to the activation (for RELU) is: IF x > 0 : 1, ELSE 0
The gradient of the Weight and Unit Input is Activation(prev) where Activation(prev) is either the output of the unit at the start of the weight, or in this case the X input.
putting this together we can find the gradient of the weight WRT the Loss.
DL/DW = DL/DA * DA/DS * DS/DW
DL/DW = 2*(Target-Activation) * 1 * X
DL/DW = 2*(0.6-0.25) * 1 * 0.5 = 0.35
Then you can adjust the weights using Weight = Weight - LearningRate * DL/DW
outside of trivial networks like this the DL/DA is actually: SUM over J (DL/DInput * Weight(This -> J). Where J is the set of units that take in the output of the activation.
its a mongo example, i cant be assed writing the derivative of anything more complicated RN.
also, I LOVE SPARSE GRADIENTS.
https://i.imgur.com/b4jUbHx.png
To start of with the chain rule of derivatives is important, it means that if we want to find the gradient of a weight with respect to the loss we get at the end of a forward pass, we can find it by multiplying the gradients in between, for example in the image (DL/DW = DL/DA * DA/DS * DS/DW)
this means that we can step backwards calculating each gradient as we go.
to find the gradient of the weight we can start by finding the gradient of the Loss with respect to the Activation(s) (for MSE loss) is (2/N) *(Target-Activation). Where N is the number of output units
The Gradient of the unit input to the activation (for RELU) is: IF x > 0 : 1, ELSE 0
The gradient of the Weight and Unit Input is Activation(prev) where Activation(prev) is either the output of the unit at the start of the weight, or in this case the X input.
putting this together we can find the gradient of the weight WRT the Loss.
DL/DW = DL/DA * DA/DS * DS/DW
DL/DW = 2*(Target-Activation) * 1 * X
DL/DW = 2*(0.6-0.25) * 1 * 0.5 = 0.35
Then you can adjust the weights using Weight = Weight - LearningRate * DL/DW
outside of trivial networks like this the DL/DA is actually: SUM over J (DL/DInput * Weight(This -> J). Where J is the set of units that take in the output of the activation.
If you dont understand anything. :shrug:
I've attached a less schitzo and more readable version of the equations for backprop in general.
>backpropogation
Just use a genetic algorithm lol. Its only like 3 times slower than gradient descent backprop but at least does not get stuck in a local minimum all the time. If you do some architecture fuckery you can even prevent overfitting
which means you stuck at local minimum all the time
3 weeks ago
Anonymous
>which means you stuck at local minimum all the time
Not reaching the global minimum with a given set of hyperparameters doesn't mean you're never reaching it with any set of hyperparameters.
not hard, like 20 lines of code, been done 40 years ago. and aside from some attention layer extras and moving buffers to the GPU, it's more or less still the same thing powering LLMs today
the final redpill is to do a forward pass, calculate the error, then pick a random neuron and change it to see if error goes down or not. repeat until you reach the desired performance
Would there be massive improvement in training efficiency if training LLMs are done in C as opposed to python which is what is commonly done in the research realm?
>Not really, the computationally costly part are already done in C
lies.
C has no auto-vectorization and is too slow.
for linear algebra, the computationally costly stuff, you would use BLAS in C. BLAS is written in Fortran which actually has auto-vectorization.
>BLAS is written in Fortran
*was. It's all C++ now. Boomers who know Fortran are dying faster than we can replace them with C++ fact. We will rewrite BLAS in Rust in the next few decades at this rate.
Microsoft has literally given Rust billions of dollars.
3 weeks ago
Anonymous
Wait what, why?
3 weeks ago
Anonymous
llama doesnt use relu
history, relu was not used because it didn't model a biological neutron correctly. alexnet said that s relu allowed model training to be several times more efficient, so relu was used every since then.
apparently the newest networks don't even use activation functions at all.
3 weeks ago
Anonymous
unless they've got something really fucky they still need an activation function of some sort to prevent linearity.
Obvs i dont have my ear to the ground on that type of NN's but maybe its just a different type of ML
3 weeks ago
Anonymous
yeah it sounded weird when i read it. they are called state space models. they sicked until last week until a paper by a single author improved them with a few obvious-in-hindsight tricks to be a gajillion percent better than transformers... at like just a few partners, no one has scanned them up yet, at least not the newer variants.
3 weeks ago
Anonymous
holy cow, swype typig is quite inaccurate
3 weeks ago
Anonymous
hey man, stop drinking beer, ok? that's a good boy.
3 weeks ago
Anonymous
nta and it wasn't billions it was millions. The reason is that the Microsoft CTO is a Rust fanboy, also all 70% of their security vulnerabilities are caused by bugs that are not possible in Rust
nagger, the heavy lifting is done in C++, the king of performance. Python is just the interface. If you did them in C it would literally be a downgrade.
anything sequential, not not time-series... hmm... how about clarifying Internet sequence? Algorithms for calculating the nth digit of Pi are extremely common and very diverse. Can bert auto-learn this? What about e? if you feed 10 digits of pi to bert, will it detect that the sequence is pi-like or e-like?
if you look up recent state space model papers, those models have modes where they can be scaled to bahave sequence-favoring (transformers, quadratic comparison complexity) to series-favoring (special ssm sauce, either linear or log complexity). you can imagine making a tiny bert with state space models configured to be transformer-like and then tweak it from there.
as an aside, berts do not normally do language generation, but recently they have been popular for that purpose. normally they are envisioned as encoders if text or sentences.
i wonder why no one embeds berts in gpts to do coarse decision making.
Programming could have been so good if only the right systems, paradigms and concepts were implemented earlier on.
The speed of programming could have been 100x faster than the current pace, I am so upset that every single person has got it completely wrong. Far out, far out, we've done everything wrong. If you could see what I can see in my own programs...
>Programming could have been so good if only the right systems, paradigms and concepts were implemented earlier on.
This.
The correct paradigm is (and always was) seething and dilating.
We were young and naive, experimenting with new patterns, new languages
But nothing really improved until finally we seethed and got a CoC in our repository.
And dilating was what opened us up, allowing us to be receptive to these new paradigms.
Except that "neural networks" are based on an untested and unproven theory of how the human brain works, and on the presumption that the brain is actually a computer and not really a kind of transceiver that enables us to interact with the aether.
Just like people stupidly and blindly accept the farce of gravity, they much the same way think that when we think thoughts, our brains are "computing" this...but no, that's really not what is happening at all.
>our brains are "computing" this...but no, that's really not what is happening at all.
And what do ou think really happens?
Perhaps this is where you should go: >>>/x/
no
I tried doing it one night after drinking but i couldn't figure out error calculation and backtracking so i gave up and went to sleep
Read a book that will teach you those things, and explain the meaning of them.
>Read a book that will teach you those things, and explain the meaning of them.
I don't really have a background in math so those books make about as much sense to me as Latin
I am an engineer though, so i do a lot of Excel and numpy/pandas related stuff, it's just theoretical math that i have trouble with. If someone could show me a guide or video showing 2 hidden layers being backtracked with actual formulas and numbers, I'll be able to copy that.
>I don't really have a background in math
>I am an engineer though
Pardon?
>Pardon?
Anon.... It may come as a surprise to you but even though we EEs study Maxwell equations, advanced calculus and other mad math topics, we don't actually understand or use that stuff very often.
The most i use for my job is maybe highschool calculus and algebra sometimes univ level trigonometry.
Not even Fourier transforms? In any case check out this book, I'm pretty certain it was written by an electrical engineer (going from memory). I don't think the math requirements should include anything you haven't seen. Taking a quick flick through the pages, the most "advanced" thing I saw was a gradient.
>engineer
>doesnt know math
Choose one!
To start of with the chain rule of derivatives is important, it means that if we want to find the gradient of a weight with respect to the loss we get at the end of a forward pass, we can find it by multiplying the gradients in between, for example in the image (DL/DW = DL/DA * DA/DS * DS/DW)
this means that we can step backwards calculating each gradient as we go.
to find the gradient of the weight we can start by finding the gradient of the Loss with respect to the Activation(s) (for MSE loss) is (2/N) *(Target-Activation). Where N is the number of output units
The Gradient of the unit input to the activation (for RELU) is: IF x > 0 : 1, ELSE 0
The gradient of the Weight and Unit Input is Activation(prev) where Activation(prev) is either the output of the unit at the start of the weight, or in this case the X input.
putting this together we can find the gradient of the weight WRT the Loss.
DL/DW = DL/DA * DA/DS * DS/DW
DL/DW = 2*(Target-Activation) * 1 * X
DL/DW = 2*(0.6-0.25) * 1 * 0.5 = 0.35
Then you can adjust the weights using Weight = Weight - LearningRate * DL/DW
outside of trivial networks like this the DL/DA is actually: SUM over J (DL/DInput * Weight(This -> J). Where J is the set of units that take in the output of the activation.
If you dont understand anything. :shrug:
>ReLU
Into the trash it goes
then use another activation. doesn't change anything.
its a mongo example, i cant be assed writing the derivative of anything more complicated RN.
also, I LOVE SPARSE GRADIENTS.
I've attached a less schitzo and more readable version of the equations for backprop in general.
Thanks for the explanation
>backtracking
backpropagation
>backpropogation
Just use a genetic algorithm lol. Its only like 3 times slower than gradient descent backprop but at least does not get stuck in a local minimum all the time. If you do some architecture fuckery you can even prevent overfitting
>does not get stuck in a local minimum all the time
Hyperparameters issue
theres no gaurantee you reach the global minimum given any hyperparameter
Yes and?
which means you stuck at local minimum all the time
>which means you stuck at local minimum all the time
Not reaching the global minimum with a given set of hyperparameters doesn't mean you're never reaching it with any set of hyperparameters.
then it would be a luck issue, not hyperparameter
SGD is theortically and practically superior. It turns out SGD is the best approximation algorithm for ERM learning which is NP-hard.
not hard, like 20 lines of code, been done 40 years ago. and aside from some attention layer extras and moving buffers to the GPU, it's more or less still the same thing powering LLMs today
>makes it run at kernel level
What happens?
Judgement day.
the final redpill is to do a forward pass, calculate the error, then pick a random neuron and change it to see if error goes down or not. repeat until you reach the desired performance
you just went full retard
99% of the effort is in autograd. Good luck.
Would there be massive improvement in training efficiency if training LLMs are done in C as opposed to python which is what is commonly done in the research realm?
no, the training algorithms are already written in cuda. python is just a wrapper for it all
Not really, the computationally costly part are already done in C
If you want a easier way to speed up common code write the algos in Julia
>Not really, the computationally costly part are already done in C
lies.
C has no auto-vectorization and is too slow.
for linear algebra, the computationally costly stuff, you would use BLAS in C. BLAS is written in Fortran which actually has auto-vectorization.
>BLAS is written in Fortran
*was. It's all C++ now. Boomers who know Fortran are dying faster than we can replace them with C++ fact. We will rewrite BLAS in Rust in the next few decades at this rate.
Except rust is dying already.
Microsoft has literally given Rust billions of dollars.
Wait what, why?
llama doesnt use relu
history, relu was not used because it didn't model a biological neutron correctly. alexnet said that s relu allowed model training to be several times more efficient, so relu was used every since then.
apparently the newest networks don't even use activation functions at all.
unless they've got something really fucky they still need an activation function of some sort to prevent linearity.
Obvs i dont have my ear to the ground on that type of NN's but maybe its just a different type of ML
yeah it sounded weird when i read it. they are called state space models. they sicked until last week until a paper by a single author improved them with a few obvious-in-hindsight tricks to be a gajillion percent better than transformers... at like just a few partners, no one has scanned them up yet, at least not the newer variants.
holy cow, swype typig is quite inaccurate
hey man, stop drinking beer, ok? that's a good boy.
nta and it wasn't billions it was millions. The reason is that the Microsoft CTO is a Rust fanboy, also all 70% of their security vulnerabilities are caused by bugs that are not possible in Rust
there's no way you would write GEMM in C or C++. it's either Fortran or hand-tuned assembly.
>he thinks people haven't optimized AI in every possible way
>yeah my random saturday idea would help the world advance.
>why is the world so dumb and I'm so good
He's just asking a question you nagger, no need to create a fake scenario in your head
asking an obviously stupid question is called baiting
nagger, the heavy lifting is done in C++, the king of performance. Python is just the interface. If you did them in C it would literally be a downgrade.
I wrote a neural network without a framework in dartlang in undergrad and php right out of college.
What cool project can you do with only a little neural network?
Like if you do it in C using only a few gigs of ram and no GPU.
bert, but for things other than language
of course then you have to use its output vector for something
>bert, but for things other than language
like what? can you even use a language model for other than language generation stuff?
anything sequential, not not time-series... hmm... how about clarifying Internet sequence? Algorithms for calculating the nth digit of Pi are extremely common and very diverse. Can bert auto-learn this? What about e? if you feed 10 digits of pi to bert, will it detect that the sequence is pi-like or e-like?
if you look up recent state space model papers, those models have modes where they can be scaled to bahave sequence-favoring (transformers, quadratic comparison complexity) to series-favoring (special ssm sauce, either linear or log complexity). you can imagine making a tiny bert with state space models configured to be transformer-like and then tweak it from there.
as an aside, berts do not normally do language generation, but recently they have been popular for that purpose. normally they are envisioned as encoders if text or sentences.
i wonder why no one embeds berts in gpts to do coarse decision making.
fuck, damnit, why does swype typing never work! okay anon i hope you can decrypt what i just wrote....
Programming could have been so good if only the right systems, paradigms and concepts were implemented earlier on.
The speed of programming could have been 100x faster than the current pace, I am so upset that every single person has got it completely wrong. Far out, far out, we've done everything wrong. If you could see what I can see in my own programs...
Please do tell anon, what programming techniques and parardigms have you been using?
rust obviously
why don't you opine about x86 while you're at it
>he isn't writing ISO C
why does anyone get even mad about this. duh you can't afford the spec and you're not gonna read it
>Programming could have been so good if only the right systems, paradigms and concepts were implemented earlier on.
This.
The correct paradigm is (and always was) seething and dilating.
We were young and naive, experimenting with new patterns, new languages
But nothing really improved until finally we seethed and got a CoC in our repository.
And dilating was what opened us up, allowing us to be receptive to these new paradigms.
what for? shit get offloaded to GPU anyway. host code literally doesn't matter. python is good enough.
>time to implement machine learning in c
Except that "neural networks" are based on an untested and unproven theory of how the human brain works, and on the presumption that the brain is actually a computer and not really a kind of transceiver that enables us to interact with the aether.
Just like people stupidly and blindly accept the farce of gravity, they much the same way think that when we think thoughts, our brains are "computing" this...but no, that's really not what is happening at all.
>our brains are "computing" this...but no, that's really not what is happening at all.
And what do ou think really happens?
Perhaps this is where you should go: >>>/x/
fuck off Penrose, nobody is buying your book
>2002 AD
>be me
>code NN library in C++ for fun before it was cool
fuck y'all normie posers
One of the most used neural network frameworks is written in C: https://github.com/pjreddie/darknet