« BackBinary Formats Galleryformats.kaitai.ioSubmitted by vitalnodo 2 days ago
  • dtagames 2 days ago

    Interesting. I didn't know anyone had come up with a declarative language for binary files.

    • jchw 2 days ago

      There's actually more than one, though Kaitai probably has the most maturity of any of them.

      Various hex editors have their own formats. 010 Editor has C-style binary templates, imhex has a binary pattern language as well. Okteta has Okteta Structure Definitions which can be declared using XML or with JS.

      Kaitai Struct is the most complete system that has code generation for multiple programming languages and isn't tied to a hex editor or anything else for that matter. That said, I think there's still a ton of room for improvement and innovation. Kaitai has a lot of useful tooling, but I think as it is today it falls a bit short: the code gen is not at the same support level for all languages (most languages are fairly limited), and I think serialization is still mostly experimental. That and there's probably a lot you could do to still make it more expressive and powerful.

      • weinzierl 2 days ago

        An adjacent or complementary field is description of data in transit. Wireshark dissectors come to mind. I think it'd be quite useful to unify these fields.

        • jchw 2 days ago

          I had been trying to make a Kaitai to Wireshark Dissector compiler in my third party Kaitai implementation[1]. However, the Wireshark emitter is still basically useless for now. It only supports basic structs with basic attrs.

          I mainly started a third-party Kaitai implementation to experiment a bit with supporting new features in Go, and also just to have a native Go implementation for convenience, since I'm still not very good at Scala. However, once an approach is developed for how exactly to handle emitting to Wireshark it should be purely mechanical to graft on a Wireshark emitter to the upstream Kaitai Struct compiler, too.

          https://github.com/jchv/zanbato

        • krapht 2 days ago

          Have you used Google wuffs?

          • jchw 2 days ago

            No, though I am familiar with it. I wouldn't have classified Kaitai and wuffs as being the same category of software, though I can see why you would.

        • rented_mule 2 days ago

          In addition to languages, there's a Python library called "construct" that's been around for a long time. It uses a declarative style to make it surprisingly easy to make binary parsers and emitters.

          https://construct.readthedocs.io/en/latest/intro.html#exampl...

          • emddudley 2 days ago

            There's an old XML one called Data Format Description Language (DFDL).

            • 0points 2 days ago

              There's a metric ton of them by now. Here's incomplete notes from a couple of years ago:

              ### kaitai - https://github.com/kaitai-io/kaitai_struct - https://github.com/kaitai-io/awesome-kaitai - http://formats.kaitai.io/dos_datetime/index.html

              ### Hexinator / Synalyze It! - Universal Parsing Engine - Hexinator is freemium version of Synalyze It! - https://github.com/synalysis/Grammars/blob/master/bitmap.gra...

              ### quickbms - http://aluigi.altervista.org/quickbms.htm

              ## multiex - http://multiex.xentax.com/

              ### Game Extractor by WATTO - http://www.watto.org/game_extractor.html

              ### 010 editor templates - https://www.sweetscape.com/010editor/repository/templates/

              ### hex fiend templates - https://github.com/HexFiend/HexFiend/tree/master/templates

              ### malcat - has some form of binary templates - https://malcat.fr/

              ### Andys Binary Folding Editor - http://www.nyangau.org/be/be.htm

              ### winhex templates - https://www.x-ways.net/winhex/templates/index.html

              ### TRiD - file identifier - TrID is an utility designed to identify file types from their binary signatures. - https://mark0.net/soft-trid-e.html

              ### GNU file - https://github.com/file/file

              ### Noesis - Noesis is a tool for previewing and converting between hundreds of model, image, and animation formats. - http://richwhitehouse.com/index.php?content=inc_projects.php... - https://github.com/RoadTrain/noesis-plugins - https://github.com/RoadTrain/noesis-plugins-official

              ### Ninja ripper - extract individual models from DirectX 3D games, while they are running - https://ninjaripper.com/

              ### Unpakke - http://www.nullsecurity.org/unpakke

              ### Camoto online-only universal game modding tool - https://moddingwiki.shikadi.net/wiki/Camoto - https://camoto.shikadi.net/

            • viraptor 2 days ago

              Imhex https://imhex.werwolv.net/ has another one. Not fully declarative, but that makes some things easier to deal with.

              • cr125rider 2 days ago

                How does this compare to how protobuf defines structures?

                • mananaysiempre 2 days ago

                  Completely different problem, completely different solution.

                  Protobuf and its ilk (ASN.1, Cap’n Proto, etc.) have you describe a tree structure, then map that to bytes according to their own sensibilities. Kaitai and its ilk (Wireshark might be a more familliar member of the group) have you describe a bunch of data structures as well as somebody else’s pretty much arbitrary ideas as to how they are to map to bytes, then deal with the results.

                  You can’t use a Protobuf implementation to get EXIF data out of JPEGs, but then you can’t get format evolution guarantees out of Kaitai either.

                  (I hear ASN.1 can somewhat cross the gap using ECN, but as far as I can tell literally nobody uses that in public.)

                • knome 2 days ago

                  You should check out erlangs binary literals.

                • redsparrow 2 days ago

                  I had a great experience using Kaitai in a previous job. We were decoding proprietary binary messages from Teltonika OBD GPS trackers. The online editor, https://ide.kaitai.io/, is really nice for developing and testing your definition. You can store multiple binary files in local-storage and you get a nice detailed look at the data and how your definition is parsing it.

                  • foobarbecue 2 days ago

                    Kaitai was awesome for reverse-engineering the Soloshot session format https://github.com/foobarbecue/soloshot-session-to-gpx-conve...

                    • hombre_fatal 2 days ago

                      Kaitai is cool but it seems like kind of a waste since you can't roundtrip the data back into binary.

                      • pmarreck 2 days ago

                        Is this able to represent any binary format? How do things like relative offsets work and such? (basically any non-rigid parts of the format)

                        • rpearl 2 days ago
                          • frizlab 2 days ago

                            It can represent an UTF-8 string, so it can probably represent anything.

                            • jcranmer 2 days ago

                              As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:

                              * Things may be non-byte-aligned bitstreams.

                              * Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."

                              * Fields that may be optional if some parent of the current record has some weird value.

                              * Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.

                              * The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)

                              and so on.

                              File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.

                            • mananaysiempre 2 days ago

                              UTF-8 is a regular language (as a subset of all octet strings), so that doesn’t feel like much of a benchmark? Something like TIFF or PECOFF would seem to be a more reasonable standard. (PDF is probably too much to ask, seeing as understanding the structure requires a full Deflate decoder among other things.)