Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a FS for any KS-generated parser. #13

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

KOLANICH
Copy link
Contributor

@KOLANICH KOLANICH commented Jul 4, 2018

It's buggy, doesn't work (I don't see there hinary blobs for now), but I still publish it.

@KOLANICH KOLANICH changed the title Addedd a FS for any KS-generated parser. Added a FS for any KS-generated parser. Jul 4, 2018
@KOLANICH
Copy link
Contributor Author

KOLANICH commented Jul 16, 2019

@GreyCat, @dgelessus, now it seems to be working.

@dgelessus
Copy link
Contributor

I've done a bit of testing - it seems to be working for the most part, but I got this exception repeatedly:

Uncaught exception from FUSE operation read, returning errno.EINVAL.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.7/site-packages/fuse.py", line 734, in _wrapper
    return func(*args, **kwargs) or 0
  File "/home/user/.local/lib/python3.7/site-packages/fuse.py", line 846, in read
    offset, fh)
  File "/home/user/.local/lib/python3.7/site-packages/fuse.py", line 1251, in __call__
    ret = getattr(self, op)(path, *args)
  File "/home/user/work/Kaitai Struct/kaitai_fs/kaitaifs/fs/AnyKSFS.py", line 317, in read
    data = self.get_file_body(obj, offset, length)
  File "/home/user/work/Kaitai Struct/kaitai_fs/kaitaifs/fs/AnyKSFS.py", line 321, in get_file_body
    return obj[offset:offset + length]
TypeError: 'SignatureType' object is not subscriptable

I was testing with the archive/phar_without_stub.ksy spec. signature_type is an enum.

Other than that it seems to be working well. Though I don't quite understand the logic behind how the fields are exposed as files - most fields are grouped together into a single __dict__.json file, but some fields become separate .bin files. Sometimes the __dict__.json actually contains a JSON array rather than an object (which is odd, because in Python __dict__ is always a dict and not a list).

@KOLANICH
Copy link
Contributor Author

KOLANICH commented Jul 17, 2019

Though I don't quite understand the logic behind how the fields are exposed as files - most fields are grouped together into a single __dict__.json

In fact it is not necessarily json, it can also be msgpack or bson or yaml, depending on filter chain. Though I have not yet implemented configuring it from CLI. Also it is possible to represent objects of other types as different files depending on their types by adding a filter. I have not yet decided what to do with strings, to put them into a dict or to put them into txt files, probably we should act depending on their sizes.

but some fields become separate .bin files.

if property type is subclass of bytes or bytearray it goes there

Sometimes the __dict__.json actually contains a JSON array rather than an object (which is odd, because in Python __dict__ is always a dict and not a list).

Yes, I have not yet implemented different naming for arrays.

I was testing with the archive/phar_without_stub.ksy spec. signature_type is an enum

Good catch. Though I am unsure how to represent it. Probably we should switch to from json to yaml and add enum name as an end line comment.

Alsk I wanted to implement caching. Currently to get sizes of json objects they are generated and then discarded. We should cache the sizes.

@dgelessus
Copy link
Contributor

probably we should act depending on their sizes

I don't think this would work well in practice - if you do that, every program that wants to read a string field needs to check both for a separate file and in the __dict__.json. Even for human users it would be confusing to have to check in multiple places.

Alsk I wanted to implement caching. Currently to get sizes of json objects they are generated and then discarded. We should cache the sizes.

It might be a good idea to cache the entire string/bytes representation of the JSON objects, that way reading them is also faster. Though to be honest I don't think performance is a big issue in practice - my "generic FS" implementation (#17) has no caching at all, and I never had any noticeable performance problems with it (even on a 2010 ThinkPad with HDD).

Copy link
Contributor

@dgelessus dgelessus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments:

  • The code doesn't follow PEP 8 - almost all names are camelCase where they should be snake_case, and spaces are missing in many places (especially in assignments and type annotations).
  • Some parts of the code are IMHO not very understandable currently, especially the TypeSplitter class hierarchy - it's not obvious which methods/fields are meant to be used externally and which ones are just implementation details. In general there should be more comments and docstrings.
  • Some of the TypeSplitter subclasses are meant as alternatives to each other and cannot be used together. It would be helpful to document this somewhere.
  • A few third-party libraries (e. g. ruamel.yaml, msgpack) are used even though they are not listed as dependencies in setup.cfg.


from fuse import FUSE

from functools import partial

from kaitaifs.parser.rar import Rar
Copy link
Contributor

@dgelessus dgelessus Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Rar import used anywhere in __main__?

def getParserCtor(moduleName:str):
import importlib
import ast
from pathlib import Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why the modules are imported inside the function? The usual Python convention is to import everything at top level, unless there's a specific reason not to (like lazy loading for performance reasons).

FILESYSTEMS = {
'rar': RarFS,
'quake_pak': QuakePakFS,
'heroes_of_might_and_magic_agg': HeroesOfMightAndMagicAggFS,
'iso9660': Iso9660FS,
}

def getParserCtor(moduleName:str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name should be get_parser_ctor, as per PEP 8. Also, get_parser_class might be a better name (from my experience, the term "ctor" is not used often in Python).

from pathlib import Path
try:
mod=importlib.import_module("."+moduleName, package="kaitaifs.parser")
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be except ImportError to avoid catching unrelated exceptions.

I'm also not sure if this try block is a good idea. The main use case for a "generic" KaitaiFS is for working with specs that KaitaiFS doesn't support natively, so there's usually no need to use it with the specs included with KaitaiFS.

Also, this could lead to confusing behavior when a spec exists both at top level and under kaitaifs.parser. If you run kaitaifs --any rar ..., the current code would always use kaitaifs.parser.rar, even if you have a rar module on your PYTHONPATH. If this try block is kept, the logic should be reversed so that it first searches for a top-level module, and only looks into kaitaifs.parser if no top-level module is found.

a=ast.parse(p.read_text())
for el in a.body:
if isinstance(el, ast.ClassDef):
return getattr(mod, el.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ast here might be a little overkill. Also, do we know for sure that the top-level class will always be the first class in the generated file?

For context, my implementation originally required the user to pass both module name and class name explicitly. Later I changed it so that the class name is automatically extracted from the module name (by taking the last part of the module name and converting it to CamelCase).

class TypeSplitter:
fileExtension = None
def __init__(self):
raise NotImplementedError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good way to implement abstract classes, because it prevents subclasses from calling super().__init__(...). It would be better to use Python's standard abc module here.

objectIndexOverhead = 2 # pair of quotes

def __init__(self):
import ruamel.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any practical differences between this module and PyYAML? I'm asking because in my experience PyYAML is more widely used (I haven't heard of ruamel.yaml before).

return res

#def getxattr(self, path, name, position=0):
# return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed instead of commented out.

}

if isinstance(obj, (bytes, bytearray)):
res['st_size'] = len(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for other object types? I remember getting odd behavior with some programs when files don't have a correct size.

'st_gid': 0,
'st_uid': 0,
'st_gid': os.getgid() if not isWin else 0,
'st_uid': os.geteuid() if not isWin else 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition necessary? I know that there's some kind of Windows port of FUSE, but I would expect it to handle file UIDs/GIDs properly (or at least ignore them if they don't apply on Windows).

If this condition is really needed, it would be good to add a comment explaining why.

@KOLANICH
Copy link
Contributor Author

if you do that, every program that wants to read a string field needs to check both for a separate file and in the dict.json

IMHO it is not programs, but is for humans manually examining files and executing hex editors on different unknown chunks or doing diffing with generic tools. Finished specialized programs should use the code generated by KSC directly.

Even for human users it would be confusing to have to check in multiple places.

Yes. it is just not comfortable to view large or multiline strings in json. We should definitely use YAML by default.

@KOLANICH
Copy link
Contributor Author

Another thing we may want to have is symlinks when multiple instances point to the same object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants