AI decompiler

Posted on March 10, 2023 by Anonymous

Why nobody made a decompiler that gets machine code and transforms into human readable Visual Studio source code project?

I'm sure the AI could guess a good name for everything inside the source code. And also organize everything in a good architecture.

Ape Out Shirt $21.68

Yakub: World's Greatest Dad Shirt $21.68

Ape Out Shirt $21.68

1 year ago

Reply

Anonymous

Decompiling is a niche practice which mostly only hobbyists and the occasional unfortunate legacy code support guru concern themselves with. While such a tool would certainly be useful to them; you'll never get production-quality source code out of it, so it wouldn't attract a new age of reverse engineers.
- 1 year ago
  
  Reply
  
  Anonymous
  
  Like how it was stated that an AI could never beat a man at chess? I think OP is right on the money that an AI could make an excellent decompiler. Saying it wouldn't be valuable is flippant and just wrong. Multi-national corps and governments will be working on this as we speak in secret.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    Absolutely. Imagine when AI gets to the point it can spit out any program you want, completely finished. It got to that point with images and it will get to that point with software too. It's just a matter of time.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Except if you don't even know what you want or don't want, which is the case with most non-programmers. Then the AI prompters will just be using the AI as a compiler for 'natural' language instead of a programming language tailored for the purpose.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Not even remotely the same.
      Image generation is only asking an AI to make something that looks vaguely like what you described, usually with mangled fingers.
      Decompilation requires the AI to generate an extremely precise output that must be logically equivalent to the binary.
      - 1 year ago
        
        Reply
        
        Anonymous
        
        >logically equivalent to the binary.
        yes this is the difficult part, computers are notoriously bad at both logic and determining equivalency
        
        1 year ago
        
        Reply
        
        Anonymous
        
        By that logic, P=NP, because it's all just logic anyways, right?
        
        1 year ago
        
        Reply
        
        Anonymous
        
        >What is the halting problem
        
        1 year ago
        
        Anonymous
        
        Do you even know how a decompiler works? They don't trace one branch in code or the other they trace both branches, exactly once. The binary would have to be infinite size for the halting problem to be relevant.
        
        1 year ago
        
        Anonymous
        
        Frick. When you led me to reply with this I get the sense this point is fundementally important, but I'm not sure how. decompiling finite sized binaries will always halt. That might kinda weakly sidestep the issue of whether an arbitrary program will halt or not. Still not decideable, but maybe it doesn't need to be decided in some situations?
        
        1 year ago
        
        Anonymous
        
        Frick. When you led me to reply with this I get the sense this point is fundementally important, but I'm not sure how. decompiling finite sized binaries will always halt. That might kinda weakly sidestep the issue of whether an arbitrary program will halt or not. Still not decideable, but maybe it doesn't need to be decided in some situations?
        
        Missed the point.
        The halting problem is a direct refutation of the argument that computers can solve any logical problem.
        
        Any any rate:
        >Do you even know how a decompiler works?
        Yes
        >They don't trace one branch in code or the other they trace both branches
        Depends on if it's a recursive descent or a linear search type.
        Linear search will go through the code from the first address to the last, doing it all in one pass, which is simpler, but usually misses a lot of details when there's no DWARF symbol table to help it.
        Recursive descent is better at finding all the code, but they can and do struggle sometimes with code that has a particularly knotted or messy control flow.
        
        1 year ago
        
        Anonymous
        
        >The halting problem is a direct refutation of the argument that computers can solve any logical problem.
        In general and for illustrative purposes, yeah. Wanting to know whether a particular program will halt is a more practical question. But maybe knowing that it can halt is still useful?
  - 1 year ago
    
    Reply
    
    Anonymous
    
    "Beating a human at a game of looking N moves into the future" is not the same as literally reversing entropy. AI can't decompile machine code into production-quality code because it's literally mathematically impossible. You're seriously asking AI to turn lead into gold here.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Decompilers from machine code to assembly language already exist and can produce byte-for-byte identical output when recompiled.
      - 1 year ago
        
        Reply
        
        Anonymous
        
        it's a one-to-one mapping
    - 1 year ago
      
      Reply
      
      Anonymous
      
      >because it's literally mathematically impossible
      You're moronic. It's a deterministic process of finitely many pieces. How is such a process "mathematically impossible" to reverse?
      - 1 year ago
        
        Reply
        
        Anonymous
        
        I agree. That anon is coping massively
      - 1 year ago
        
        Reply
        
        Anonymous
        
        It's a many-to-one relationship. It's unrecoverable.
        
        Decompilers from machine code to assembly language already exist and can produce byte-for-byte identical output when recompiled.
        
        Decompiling machine code into "code which would compile into that machine code" exists. Giving the variables meaningful names is not a "byte-for-byte" process; it is a generative process and it is by all measurable means impossible.
        
        1 year ago
        
        Reply
        
        Anonymous
        
        It doesn't need to be identical to the original source code, comments and all for it to be useful. Just high level language code that produces the same output.
        
        1 year ago
        
        Reply
        
        Anonymous
        
        If it were impossible for the exact meaning of the compiled output to be decided, it could not be executed by a computer.
        
        1 year ago
        
        Anonymous
        
        There is no single one-to-one mapping of a compiled binary to source code. For a simple example, suppose there is a binary blob containing an image; if you know the stride, pixel format, start, and end of the image in the bob, it's possible to extract the image, but without that, you're reduced to scanning through all possible combinations searching for an image. Even if you find something that looks like an image, there's fundamentally no way to know if it's the "correct" one. This is much worse for a machine code to source code conversion.
        
        1 year ago
        
        Anonymous
        
        True. An AI would have to make guesses based on heuristics, the same as people would when doing the same task. The only difference would be the AI could make guesses much quicker.
        
        1 year ago
        
        Anonymous
        
        Not to mention, there's no one-to-one mapping of source code to binary either, otherwise compiler flags wouldn't be a thing.
        
        Also, binaries usually have a few weird sections of code that break the conventions and idioms used elsewhere in the code, usually as a result of some hand written assembly that got pulled in through a library, or a .dll/.so that was generated by a different compiler than the rest of the program.
        
        1 year ago
        
        Anonymous
        
        The compiler flags are effectively just part of the source code. So could be predicted by an AI just the same.
        
        1 year ago
        
        Anonymous
        
        Way to miss the point.
        The point is that source code to binary is a many-to-many mapping, and that makes formal verification that any two source codes/binaries are equivalent a very difficult task.
        
        1 year ago
        
        Anonymous
        
        If the output binaries are the same byte for byte, then the source codes used to make them are equivalent.
        
        1 year ago
        
        Anonymous
        
        Good luck getting byte-identical binaries.
        That level of reproducibility almost always requires using the the same compiler, with the same exact version, and the same exact flags.
        
        1 year ago
        
        Anonymous
        
        Most binaries have that information readily available by just looking at its contents in a hex editor.
        
        1 year ago
        
        Anonymous
        
        plus, it don't take much to run all the compilers on the source code in parallel
        
        1 year ago
        
        Anonymous
        
        "Most". Maybe PC software does, but I can assure you there is shitloads of software/firmware that does not.
        
        plus, it don't take much to run all the compilers on the source code in parallel
        
        You sure about that?
        I'm seeing 200+ release versions of GCC, 77 release versions of LLVM, and who knows how many releases of MSVC and Intel C++ Compiler, not to mention less popular compilers like TCC or once popular compilers like Borland.
        Then you have to multiply that by the number of flags available for those compilers.
        
        1 year ago
        
        Anonymous
        
        >GCC
        Basically everything compiled with gcc is freetard shit with source code already available
        >LLVM
        >MSVC
        >ICC
        These are the three that are actually important, probably ICC much less important than MSVC and Clang
        >TCC
        >Borland
        irrelevant
        
        1 year ago
        
        Anonymous
        
        TCC and Borland have compiled a lot of legacy software. There are many pieces of software that are too good to be replaced, even though the company went bust and all the programmers are senile.
        
        1 year ago
        
        Anonymous
        
        >TCC
        Borland maybe, but TCC is just a toy compiler that no one uses for serious projects.
        For obsolete software it will be cheaper to buy the source code from the company than to develop an AI decompiler.
        
        1 year ago
        
        Anonymous
        
        I work at a place that uses some DOS software and we don't even know how to contact the people who wrote it. Currently we have to buy special motherboards designed for old PC software
        
        1 year ago
        
        Anonymous
        
        If there are 10000 potential toolchains, there will still be heuristics that can get you to a manage set to test against. I know LLVM being used will be identifiable simply but looking at how registers are allocated.
        
        1 year ago
        
        Anonymous
        
        Also the fact that LLVM never emits an "add" instruction with a constant, only a "sub" instruction with a negative constant. It's a bit silly really.
        >ecx = ecx + 1
        turns to
        >sub ecx, 0xffffffff
        
        1 year ago
        
        Anonymous
        
        I think him and you misunderstood the implications. Translation from binary to a source code would imply an algorithm that understand exactly what a program does.
        This allows the algorithm to know exactly if the program halts, and that lead to a contraddiction
        
        1 year ago
        
        Anonymous
        
        >This allows the algorithm to know exactly if the program halts
        No, knowing what a program does doesn't tell you enough to know if it will halt.
        
        1 year ago
        
        Anonymous
        
        Knowing what a program does would encompass also knowing if it halts.
        If you don't know if the program halts, you don't fully know what the program does.
        
        1 year ago
        
        Anonymous
        
        You can't tell if a program halts just by looking at source code either.
        
        1 year ago
        
        Anonymous
        
        Yes you can. You can't do it algorithmically
        
        1 year ago
        
        Anonymous
        
        If you think of it that way, that you can't know what programs that depend on user input do, even if you wrote them. A program you might write that halts when the user presses 'q' on the keyboard might never halt if the user never presses that key. Would it then be fair to say you don't know what the program you wrote does?
        
        1 year ago
        
        Anonymous
        
        > A program you.might write that halts when the user presses q
        Why such thing should happen? Did you take any course on computability theory?
        
        And yes, a computer can't know what your program does for the same reason of the halting problem
        
        1 year ago
        
        Anonymous
        
        There's that coping again. Flip flopping between the concrete and abstract won't change that this is possible and will soon be carried out, if it hasn't been already. By the way, your grammar is interesting. What's your native language?
        
        1 year ago
        
        Anonymous
        
        >Theory be damned, just throw more machine learning and compute cycles at it!
        Lol, classic pop-sci AI-tard cope.
        
        1 year ago
        
        Anonymous
        
        Avoiding that question huh? Russian is it?
        
        1 year ago
        
        Anonymous
        
        There more than one person who thinks you are a moron anon
        
        1 year ago
        
        Anonymous
        
        Feelings mutual
        
        1 year ago
        
        Anonymous
        
        But we are right to think so. It's CS101
        
        1 year ago
        
        Anonymous
        
        No you aren't. You're aren't even following the discussion.
        
        1 year ago
        
        Anonymous
        
        You don't seem to be either.
        Every time you're proven wrong you fall back to an already disproven argument..
        
        1 year ago
        
        Anonymous
        
        The halting problem doesn't state that you never tell when a program halts, it just states that there's no algorithmic solution to determine if any arbitrary program halts.
        Therefore, a proof by contradition doens't apply here.
        
        Looking at in another way, one could write out the formal logic which describes the conditions for when the program halts, which fully describes the halting behavior of the program.
        Solving the halting problem in this case is 'simply' a matter of solving the boolean satisfiability problem, which is a NP-complete problem.
        
        1 year ago
        
        Anonymous
        
        I don't dispute the existence of the halting problem. I'm just saying an AI producing source code from an output binary isn't equivalent to it.
        
        1 year ago
        
        Anonymous
        
        It's pretty close when you consider the fact that you need to make sure that both are logically equivalent.
        Hence the start of this stupid debate:
        >yes this is the difficult part, computers are notoriously bad at both logic and determining equivalency
        Which was then refuted by contradiction via the Halting Problem, a logical problem which cannon be solved by computers.
        
        1 year ago
        
        Anonymous
        
        It might take a long time, but there are a finite number of toolchains to check, with likely a small number of potential candidates. Iterating sequentially over a list you'll definately come to the end of it.
        
        1 year ago
        
        Anonymous
        
        That assumes you have identical source code (excluding variable names or whitespace).
        
        1 year ago
        
        Anonymous
        
        Besides I've seen disassemblies that produce byte for byte identical output produced by people. A program that outputs the target bytes from an array into a file will do so even if the compiler isn't the same.
        
        1 year ago
        
        Anonymous
        
        >A program that outputs the target bytes from an array into a file will do so even if the compiler isn't the same.
        Any you don't need AI for such trivial cases.
        The proof is in the complex examples, which is where formal equivalence is most important.
        
        Also, I'm not sure why you're so hung up on byte-equivalence.
        I'm talking about the broader problem of just checking if two different source codes/binaries are logically equivalent, since it's much easier to make two logically equivalent programs than it is to try and force a compiler to spit out the same binary again, given that source code to binaries is a many-to-many mapping.
        
        1 year ago
        
        Anonymous
        
        >source code to binaries is a many-to-many mapping.
        Yes. So there's many potential source codes that produce the same output.
        
        1 year ago
        
        Anonymous
        
        If a machine cannot decide if two code are semantically the same you can't also build an algorithm that can produce a source code for any binary
        
        1 year ago
        
        Anonymous
        
        In general that's true, but if they are byte for byte the same they must behave identically.
        
        1 year ago
        
        Anonymous
        
        Frick sake, we've been over this: getting byte-identical outputs is the exception for decompilation projects, not the norm.
        
        1 year ago
        
        Anonymous
        
        But it is always possible with due care. More broadly, do you think a set of bytes exists that a copy cannot be made of. It is equivalent to that fundamentally.
        
        1 year ago
        
        Anonymous
        
        But it's still impossible to guarantee that for any binary
        Please open a cs book
        
        1 year ago
        
        Anonymous
        
        Two copies of any series of bytes are the same, for any series of bytes. It is you who is moronic.
        
        1 year ago
        
        Anonymous
        
        But it's almost impossible to have to identical binaries from every different but semantically equivalent code you cretin
        
        1 year ago
        
        Anonymous
        
        An AI that could do this would become a compiler in it's own right anyway. Translating from compiled code to source code rather than the other way around. Ever heard of Turing completeness?
        
        1 year ago
        
        Anonymous
        
        At this point you must be trolling I refuse to believe you're so clueless and stubborn
        
        1 year ago
        
        Anonymous
        
        I'm not trolling. But you seem clueless and stubborn to me. So let's just agree to differ.
        
        1 year ago
        
        Anonymous
        
        >clueless and stubborn.
        That would be you, refusing to understand even basic theory of computer science.
        
        1 year ago
        
        Anonymous
        
        What exactly do you believe I don't understand?
        The halting problem?
        Turing completeness?
        
        1 year ago
        
        Anonymous
        
        All of it, clearly, because you're not following the debate and are fixating on one tiny facet of equivalence.
        Having identical binaries is the ideal case, but not the most probable one.
        
        Full on formal verification that two programs are equivalent basically falls under the domain of Satisfiablility Module Theories, which are NP-hard.
        Comparing just two different but equivalent boolean equations is the SAT problem as mentioned earlier, which is NP-complete, and even a boolean equation as small as 64 variables can take a consumer computer a considerable amount of time to verify, never mind an entire program.
        
        There was a research paper that just came out on Ghidra's P-code and how it doesn't completely cover the semantics of instructions, which is currently impeding Ghidra's ability to do tasks as simple as symbolically interpreting a the original binary.
        Keep in mind, this isn't even getting into decompilation, this is still at the level of interpreting individual instructions in the original binary, and even that is fraught with problems of maintaining semantic equivalence.
        
        1 year ago
        
        Anonymous
        
        >Full on formal verification that two programs are equivalent basically falls under the domain of Satisfiablility Module Theories, which are NP-hard.
        
        So what? You don't even need to do that shit at all if you restrict yourself to where the output is identical. We wouldn't be looking to find the infinite set of all equivalent programs. At least I wouldn't. You can boil the ocean for the rest of eternity if you want to.
        
        1 year ago
        
        Anonymous
        
        >restrict yourself to where the output is identical.
        And you do you propose to do that?
        Multiple anons have chimed in at this point to say how difficult it its to get an identical binary out.
        
        1 year ago
        
        Anonymous
        
        They're wrong. want a binary with 0x01, 0x02, 0x03 in it?
        
        const fs = require('fs');
        fs.writeFileSync('index.txt', Buffer.from([
        // first number
        0x01,
        //second number
        0x02,
        //third number
        0x03]*~~;
        
        The comments can be whatever meaning gets ascribed to them, from a human or an ai
        
        1 year ago
        
        Anonymous
        
        The comments can be in some other programming language substitutable at your leisure and the bytes don't need to be listed on order.
        
        1 year ago
        
        Anonymous
        
        Congrats, you made a trivial example. One example isn't a proof for the general case of any arbitrary binary.
        But also, that code actually probably won't compile to a byte-identical binary if you change compilers or flags, so it's not even a good choice of example for you.
        
        so you don't consider it possible to translate from english to french with a computer? The millions of people using google translate every day don't care that you think that you don't think so.
        
        How bout from assembly language to JavaScript? or vice versa?
        
        Going from machine code to assembly is like translating from binary to ASCII.
        It's semantically the exact same information.
        On the other hand, translating from English to French is very difficult without subtly changing the meaning.
        
        For instance, here's my post translated by Google translate from English to French and back again.
        
        >Congratulations, you have made a trivial example. An example is not a proof for the general case of an arbitrary binary. But also, this code probably won't be compiled into a byte-identical binary if you change compilers or flags, so it's not even a good example choice for you.
        >Moving from machine code to assembly is like translating binary into ASCII. It is semantically exactly the same information. On the other hand, translating from English to French is very difficult without subtly changing its meaning. For example, here is my article translated by google translate from English to French and vice versa.
        >You will notice that the text is not the same
        
        You'll notice the text is not the same.
        
        1 year ago
        
        Anonymous
        
        meanwhile, in the real world outside of the ivory tower:
        
        https://github.com/sonicretro/s1disasm
        
        Made by gamers of all people, with human readable labels and variable names to boot. Not even an AI used here. Maybe get ChatGPT to explain it to you in your language of choice.
        
        1 year ago
        
        Anonymous
        
        That's a disassembly, not a decompilation.
        Disassembly is orders of magnitude simpler than what we're discussing.
        
        1 year ago
        
        Anonymous
        
        so you don't consider it possible to translate from english to french with a computer? The millions of people using google translate every day don't care that you think that you don't think so.
        
        How bout from assembly language to JavaScript? or vice versa?
        
        1 year ago
        
        Anonymous
        
        How bout C64 basic converted into C then?
        
        https://github.com/mist64/cbmbasic
        
        Gidra P code. Jeez...
        
        1 year ago
        
        Anonymous
        
        Doesn't emulate a 6502 and therefore doesn't support support the USR command.
        You probably wouldn't be able to run most C64 basic programs that use PEEK and POKE to directly manipulate the C64 either.
        
        1 year ago
        
        Anonymous
        
        If you were using it as a scripting language, you wouldn't give a shit. Only pedandtic twats would.
        
        1 year ago
        
        Anonymous
        
        Anon, this whole fricking thread has been one big debate about formal equivalence.
        Think before you post.
        
        Also, I searched through the whole code base and found lots of commands appear to be missing. No ABS, LOG, SGN, DIM, and probably more I haven't checked.
        
        1 year ago
        
        Anonymous
        
        It's not the same any more. The authors stripped stuff out and added other stuff to repurpose it. It's neat though huh?
        
        1 year ago
        
        Anonymous
        
        It's definitely neat! But a bad choice of example for this debate.
        (Likewise, I'm confused why the author claims "100% compatibility", when it's clearly not. I admit USR was a nitpick, but the lack of half the math functions is weird.)
        
        Some want it to be about proper formal equivalence. I suppose its depends on how formal you need that equivalence to be.
        
        Well, I'm coming from a security research point of view, since that's one of the few real use cases for decompilation.
        And in that case, you need it to be exactingly accurate.
        
        1 year ago
        
        Anonymous
        
        >the lack of half the math functions is weird
        That is weird. Maybe they aren't needed for messing about with text files? I linked to this as it isn't merely a reimplementation, but a machine conversion of the original rom's code to a high level language (which has since been modified), which some here seem to have claimed is impossible.
        
        1 year ago
        
        Anonymous
        
        Actually, thinking about this some more, the math functions weren't part of the c64 basic rom. They were on the c64 kernel rom. Long story.
        
        1 year ago
        
        Anonymous
        
        >which some here seem to have claimed is impossible.
        No one claimed that.
        The arguments thus far have been that:
        1. AI is ill-suited for this task
        2. Algorithmic decompilation is unlikely to create a byte-identical binary once recompiled
        3. Even Algorithmic decompilation needs to be very careful to ensure it doesn't change the semantics of the program.
        
        mist64/cbmbasic fails to disprove any of these arguments since it:
        1. Is not decompiled by an AI
        2. Does not produce a byte-identical binary
        3. Is not semantically equivalent to CBM Basic on a real C64.
        
        1 year ago
        
        Anonymous
        
        yes, of course. But I'm saying that only opinions have been offered around the first point one way or the other. And the third point may not matter depending on your needs. No two things can every be totally equivalent to arbitrary precision anyway. Similar to with the ship of theseus. Everything can only ever be a sufficient approximation at some level.
        
        1 year ago
        
        Anonymous
        
        Some want it to be about proper formal equivalence. I suppose its depends on how formal you need that equivalence to be.
        
        1 year ago
        
        Anonymous
        
        >So there's many potential source codes that produce the same output.
        Only because there are multiple compilers each with multiple compilation options.
        To reproduce the binary you need the same compiler, with the same flags, with the same source.
        It's an unbounded n^3 search space at the very minimum.
        
        1 year ago
        
        Anonymous
        
        for example, the ADD instruction always adds, so you have more context available. And you don't need to use the same compiler or even the same language as the original used. Only the output needs to be the same. If it wasn't the same output the program might still behave the same, but that is indeed a hard thing to determine.
        
        1 year ago
        
        Anonymous
        
        >And you don't need to use the same compiler or even the same language as the original used
        And theoretically my dryer could fold all my clothes.
        The chances of getting any language other than assembly to match up with a binary generated by a different language are near zero.
        Even assembly wouldn't guarantee it, since lots of instructions have multiple possible machine code equivalents (a fact that is sometimes used to watermark binaries).
        
        But it is always possible with due care. More broadly, do you think a set of bytes exists that a copy cannot be made of. It is equivalent to that fundamentally.
        
        >But it is always possible with due care
        Yes, but you're talking about extreme attention to detail, often using compiler directives to force the output to be the same, like forcing variables to be allocated to a specific address.
        
        >More broadly, do you think a set of bytes exists that a copy cannot be made of.
        No, that seems to be the straw man you've made of my argument though.
        My whole point is that going back and forth across a many-to-many mapping is statistically unlikely to produce a byte-identical equivalent, especially when you're throwing a moronic AI at the problem.
        
        1 year ago
        
        Anonymous
        
        > Extremely unlikely
        I think it's mathematically impossible. Let's have two different program that approximate a known distribution with a particle filtering algorithm
        
        Let's say that one occupy N bytes. If you add a sufficient number of random samples evaluations that translate to a binary that exceed N you have two semantically equivalent programs that cannot have the same
        
        1 year ago
        
        Anonymous
        
        *have the same binary. You can grow N to infinity and thus you have a finite number of byte equivalent-semantically equivalent programs and an infinite number of semantically equivalent-byte different programs
        
        1 year ago
        
        Anonymous
        
        The byte different, semantically equivalent programs would certainly be very hard to identify in general.
        
        1 year ago
        
        Anonymous
        
        I'd agree practically impossible. That's the reason for wanting to restrict to byte equivalent
        
        1 year ago
        
        Anonymous
        
        > It might take a long time
        It's not a matter of time, it's just impossible
      - 1 year ago
        
        Reply
        
        Anonymous
        
        I divided a certain number X by 5001 and the remainder is 602
        what was the X number I used for aforementioned operation? The process was purely deterministic, and consisted of finite amount of data pieces and operations
    - 1 year ago
      
      Reply
      
      Anonymous
      
      What? Yes we can. People have done it. You read the disassembly, internalize the theory, and reimplement it.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    I don't doubt AI assist might become a thing, but as impressive as current projects like ChatGPT are, they're miles from being able to produce decent decompilation results.
    Decompiling software requires very fine attention to detail to accurately reproduce the complex interactions between instructions.
    Current AI approaches would confidently give you something that looks correct, but is actually all bullshit, or mirrors the general idea of the code but fails to capture all the precise details that are so important to a successful decompilation.
    
    It's not like chess where the problem space is limited to a 8x8 board with less than 32 pieces on it.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Decompiling is completely different from creating code from a description.
      
      Because decompiling can be easily tested by just compiling the code to check if the outcome is right or wrong.
      - 1 year ago
        
        Reply
        
        Anonymous
        
        There are infinitely many ways for the code to be wrong, and only a small number of ways that are correct.
        Also, formal verification that the binary produced from the decompiled code is equivalent to the original binary is no small task either.
        
        1 year ago
        
        Reply
        
        Anonymous
        
        The decompiler could be made in the traditional way.
        
        Then the AI would just organize stuff and give everything human recognizeable names.
        
        1 year ago
        
        Anonymous
        
        The next version of IDA pro will probably have the feature.
- 1 year ago
  
  Reply
  
  Anonymous
  
  even if it diden't produce source code, just naming functions and variables would make life a lot easier, guessing what variables are supposed to represent is often a lot harder than following along with assembly code
  - 1 year ago
    
    Reply
    
    Anonymous
    
    agreed
- 1 year ago
  
  Reply
  
  Anonymous
  
  I think OP had a great idea and you are the typical basement chud that never achieve anything and always complains about others. aka loser
  
  If OP makes a tool like that, I’m sure people will use it and he’ll land in a great job position
1 year ago

Reply

Anonymous

The amount of over it would be for programmers would surge through the roof - but it'd be a legal nightmare. It's a conflict of the primary and wealthiest demographic for such a product and who can actually make it.
1 year ago

Reply

Anonymous

If this is true it will utterly destroy freetards and freetard software. Imagine if they try to shill their "open source" garbage and you just tell them
>uhh but everything is open source if you just open it in (A)IDA!
- 1 year ago
  
  Reply
  
  Anonymous
  
  How would it be a bad thing for free software? It's one thing to recommend against pirated windows, it's another one to use a version without the spyware
- 1 year ago
  
  Reply
  
  Anonymous
  
  well thats not actually what open source means so i doubt they would care
- 1 year ago
  
  Reply
  
  Anonymous
  
  >pass
  >moronic
  it all adds up
1 year ago

Reply

Anonymous

>Visual Studio
1 year ago

Reply

Anonymous

If you have a model that has a ton of source > compile > machine code in its training then that would be a start...
But considering how the models work getting to what you're thinking of is going to take some effort since they love to just spit out bullshit by default that looks good.
Actually I have an idea a way to test this out right now. A unique situation that isn't useful at all to what you're saying with VSCode and shit, but something it should have in its training that's similar to my first paragraph...it won't be useful in this form but I'll go see how viable it might be with chatGPT in its current shitty state.
I'll report back. It's not going to be interesting though and there will be nothing specific to show. But I'll see if it's possible to go from bytecode back to a function(actually just saying that I can picture how it won't even be able to get the function name right but whatever...).
- 1 year ago
  
  Reply
  
  Anonymous
  
  Yeah this somewhat unsurprisingly didn't work, but I kind of shot for the stars with what would have been really impressive if it had worked.
  This form of free chatGPT is dogshit as everyone knows and I don't have API access at the moment anyway to really try to stick enough forks into it to get it doing something super basic with bytecode to work up from there...
  This is one of those things that seems like fool's errand though tbh if you're seriously gunning for a productive result...but for a challenge just to get some really basic shit going reliably...maybe
1 year ago

Reply

Anonymous

>thinking AI-worshipping homosexuals know the first thing about code theory
My mistake. I'll leave you to your NFT mine.
1 year ago

Reply

Anonymous

There's a neat idea and it's reversible, in that you can compare the bytecode you generate with the compiler to the AI generated source code until it gets it right.
I'm sure that someone has already done this.
- 1 year ago
  
  Reply
  
  Anonymous
  
  If not, several teams across the world are working on it now.
1 year ago

Reply

Anonymous

Start saving binaries from things you want to recompile in the future because there will obviously be obfuscation added in the future to try to stop freedom
- 1 year ago
  
  Reply
  
  Anonymous
  
  If the code needs to still be executable, which it would, any obfuscation would be useless.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    See denuvo being hacked recently in the news for a case in point. That didn't even need an AI do be done.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    >If the code needs to still be executable, which it would, any obfuscation would be useless
    elaborate? are you saying obfuscated code cant execute?
    - 1 year ago
      
      Reply
      
      Anonymous
      
      No, if it can execute then a suitable AI wouldn't be confused be any obfuscation.
      - 1 year ago
        
        Reply
        
        Anonymous
        
        well idk i imagine what OP is talking about is you would compile a program then train the AI on the binary
        eventually it leans how the specific compiler generates binaries. But its not gonna learn how to unfrick obfuscation thats applied after compilation.
        I mean most obfuscation is a lossy process, you literally lose information about the original control flow, an AI would never be able to undo that. So obfuscation would still be effective.
        
        1 year ago
        
        Reply
        
        Anonymous
        
        You lose information when you scale an image to a smaller size but AI can still upscale it decently.
  - 1 year ago
    
    Reply
    
    Anonymous
    
    Have you ever reversed a binary?
    It doesn't take much to make decompilation a real headache.
    Self-modifiying code, non-standard calling conventions, call indirection, on-the-fly decryption, and embedding custom virtual machines can all dramatically increase the amount of work that needs to be done, while also increasing the number of opportunities for introducing subtle errors.
    Sure, dynamic analysis helps, but that's not always possible, or desirable.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      Yeah, I have. It still requires manual creative leaps at present, but nothing in principle that couldn't be brute forced by an AI.
    - 1 year ago
      
      Reply
      
      Anonymous
      
      that would be also goal of the AI to solve
1 year ago

Reply

Anonymous

if AI can translate languages, then they can turn assembly into source code. it's that simple.
1 year ago

Reply

Anonymous

we already have static decompilers that can guess functions based on machinecode patterns. Sure you could get an AI to do this too. The trouble is actually connecting everything so that it works. A single wrong statement makes the whole source code not work.
Its like, how you can ask an AI right now to generate every single individual snippet of code that is happening in a whole program, and it can do that no problem. But ask it to take all those snippets and actually connect them into a program? it cant do that, because that would require the AI to actually understand what code is, which is doesn't, at least language models dont.
1 year ago

Reply

Anonymous

You will never have a perfect decompiler since compilation is inherently a lossy process.
- 1 year ago
  
  Reply
  
  Anonymous
  
  If your CPU is broken, the program still won't behave the same as it hypothetically otherwise would. But that's metaphysics territory, nonsense also.
1 year ago

Reply

Anonymous

>Why nobody made a decompiler that gets machine code and transforms into human readable Visual Studio source code project?
Because your machine code is just a binary code.
And everyone who knows the binary system, can convert it into number->signs.

The israelites did this thousands of years ago in their Bible (Talmud Gematria).
- 1 year ago
  
  Reply
  
  Anonymous
  
  >The israelites did this thousands of years ago in their Bible (Talmud Gematria).
  
  I lolled at this comment. Thanks for the change of pace.
  I'm told pop music today just copies what the last generation had as well 🙂
1 year ago

Reply

Anonymous

Okay, okay AI-bro, let assume that you have a NN that does the decompilation, and get the result. Then you put that result into a compiler to get the executable code again, which you compare to the original one. BUT, what will you tell the user if they're not equal byte by byte? This program cannot be decompiled?
- 1 year ago
  
  Reply
  
  Anonymous
  
  Hmm. That's a tough one. Perhaps better UX and expectation management so the user isn't expecting miracles in the first place? If the tool claims the result of the tool is a starting point for their workflow for example. That's how existing decompilers sell themselves. At least they'd be getting better labels than sub_54EB:
  - 1 year ago
    
    Reply
    
    Anonymous
    
    Okay, so how can you guarantee that your NN will not spewing bullshit?
    Honestly, even when training a NN on IDA data, you cannot ensure me it won't spewing something different from IDA output
    - 1 year ago
      
      Reply
      
      Anonymous
      
      You wouldn't be training it on raw IDA pro output. You'd be training it on the end result with proper label names after human intervention. Or better yet on decompiled code that has the DWARF debugging symbols with it. you'd also hook it into the decompiler so it would only be used to provide symbol names.
1 year ago

Reply

Anonymous

>decompiler
MUH COPYRIGHTS OY VEY
- 1 year ago
  
  Reply
  
  Anonymous
  
  Security researchers need to know this stuff. This AI would be as an additional assistance to them to make them faster.
- 1 year ago
  
  Reply
  
  Anonymous
  
  The decompiled source could be automatically annotated through plain English comments and and uploaded on m$ GitHub. So that another ai called copilot will obfuscate its origins and make it legal
1 year ago

Reply

Anonymous

Traditional computer software tools resemble the standard mathematical concept of a function f:XY: given an input x in the domain X, it reliably returns a single output f(x) in the range Y that depends on x in a deterministic fashion, but is undefined or gives nonsense if fed an input outside of the domain. For instance, the LaTeX compiler in my editor will take my LaTeX code, and - provided that it is correctly formatted, all relevant packages and updates have been installed, etc. - return a perfect PDF version of that LaTeX every time, with no unpredictable variation. On the other hand, if one tries to compile some LaTeX with a misplaced brace or other formatting problem, then the output can range from compilation errors to a horribly mangled PDF, but such results are often obvious to detect (though not always to fix).

#AI tools, on the other hand, resemble a probability kernel μ:XPr(Y) instead of a classical function: an input x now gives a random output sampled from a probability distribution μₓ that is somewhat concentrated around the perfect result f(x), but with some stochastic deviation and inaccuracy. In many cases the inaccuracy is subtle; the random output superficially resembles f(x) until inspected more closely. On the other hand, such tools can handle noisy or badly formatted inputs x much more gracefully than a traditional software tool.

Because of this, it seems to me that the way AI tools would be incorporated into one's workflow would be quite different from what one is accustomed to with traditional tools. An AI LaTeX to PDF compiler, for instance, would be useful, but not in a "click once and forget" fashion; it would have to be used more interactively.

https://mathstodon.xyz/@tao/109971907648866712
1 year ago

Reply

Anonymous

There are research projects that do this. Look at recent years of USENIX security, NDSS, etc.

Cancel reply