Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor many things, fix extra/missing commas #58

Open
wants to merge 102 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
eb7ebed
ref: improve usesAscii28 parsing from version
NickCrews Apr 6, 2023
c218b74
ref: extract ctxWarn() helper
NickCrews Apr 6, 2023
8e0e914
ref: add freeSafe() helper
NickCrews Apr 6, 2023
413fbcc
ref: add _compileRegex() helper
NickCrews Apr 6, 2023
1df3bd3
ref: statically compile regexes
NickCrews Apr 6, 2023
cb308c2
ref: split parseHeader() into *Legacy() and *NonLegacy() versions
NickCrews Apr 6, 2023
a79be13
ref: move string stuff to string_utils.h
NickCrews Apr 7, 2023
22fcc2c
ref: remove unused FEC_CONTEXT.useAscii28
NickCrews Apr 7, 2023
1467a7d
ref: Move isParseDone() to csv.h and csv.c
NickCrews Apr 7, 2023
4ef050a
ref: move initParseContext() into csv.h and csv.c
NickCrews Apr 7, 2023
088bf0b
ref: Move readField() into csv.h
NickCrews Apr 7, 2023
98774db
ref: store FIELD_INFO in PARSE_CONTEXT as value, not pointer
NickCrews Apr 7, 2023
88c31c7
ref: clean up a few warnings
NickCrews Apr 7, 2023
cd86048
ref: Move mapping regexes to separate, static struct
NickCrews Apr 7, 2023
efe1ff3
ref: deduplicate testIncludes in build.zig
NickCrews Apr 7, 2023
2f0d02f
ref: format test files with black
NickCrews Apr 7, 2023
41672ce
ref: Replace py.path with pathlib.Path in tests
NickCrews Apr 7, 2023
30b27a7
ref: test: simplify getting fixtures
NickCrews Apr 7, 2023
c9ef40c
ref: blacken client.py
NickCrews Apr 7, 2023
706710b
fix: Make reader thread a daemon
NickCrews Apr 7, 2023
c784f91
test: ref: Overhaul python testing
NickCrews Apr 7, 2023
7bc8549
ref: take char*, not PARSE_CONTEXT, as arg to lookupMappings()
NickCrews Apr 8, 2023
7427a07
ref: move create lookupType() helper in mappings.c
NickCrews Apr 8, 2023
18451c1
test: add a unique marker for each test case
NickCrews Apr 8, 2023
c17412e
ref: Move all form schema lookup logic into mappings.c
NickCrews Apr 8, 2023
46db0b4
ref: remove unused collectLineInfo() from encoding.h
NickCrews Apr 8, 2023
05dccfa
ref: rename lookupSchema() to formSchemaLookup()
NickCrews Apr 8, 2023
6772cdc
ref: rename lookupMappings() to updateCurrentFormSchema()
NickCrews Apr 8, 2023
93456d3
fix: Print correct string on extra column
NickCrews Apr 8, 2023
2e84117
ref: factor out ctxWriteField() when printing lines
NickCrews Apr 8, 2023
3dd36c4
ref: Rename writeQuotedCsvField() to writeQuotedString()
NickCrews Apr 8, 2023
86730f5
tests: add xfailing traling_commas test
NickCrews Apr 8, 2023
dbb7ee8
ref: rename PARSE_CONTEXT to CSV_LINE_PARSER
NickCrews Apr 8, 2023
4e5cdd1
ref: Simplify API of csv.writeField()
NickCrews Apr 8, 2023
b55d079
ref: tidy up csv.h
NickCrews Apr 10, 2023
29a73bf
ref: simplify readCsvSubField()
NickCrews Apr 10, 2023
66ad20d
ref: csv: make advanceField() a noop if isParseDone()
NickCrews Apr 10, 2023
39e5174
ref: make setVersion() take explicit str
NickCrews Apr 10, 2023
b30a4de
test: rename test case to legacy_header
NickCrews Apr 10, 2023
0d4d3c6
ref: fix a few simple warnings in mappings.c
NickCrews Apr 10, 2023
1c8173a
fix: Fix segfault crash when parsing legacy header
NickCrews Apr 10, 2023
3f75fd1
fix: test: fix legacy_header test case
NickCrews Apr 10, 2023
361fac5
ref: simplify comments in parseFec()
NickCrews Apr 10, 2023
0c2d4e3
ref: mark few private functions static
NickCrews Apr 10, 2023
d6df47f
ref: simplify parseLine() type warnings
NickCrews Apr 10, 2023
9e53f1b
Remove extra FEC constant in fec.c
NickCrews Apr 10, 2023
e5a70d6
feat: python: prioritize zig-out dir when searching for dll
NickCrews Apr 10, 2023
1f9aa57
test: ignore .DS_Store files from test cases.
NickCrews Apr 10, 2023
d6d3f5f
ref: const-ify many args
NickCrews Apr 10, 2023
aea66e9
ref: fix warnings on comparison of diff types
NickCrews Apr 10, 2023
e2abdc9
ref: explicitly pass NULL for types for header
NickCrews Apr 10, 2023
d0d8188
ref: fix: make formSchemaFree() safe from NULLS
NickCrews Apr 10, 2023
c531383
ref: encapsulate FEC_CONTEXT into FORM_SCHEMA
NickCrews Apr 11, 2023
e632121
test: add csv_test case
NickCrews Apr 11, 2023
235aaae
ref: remove unused asciiOnly flag from LINE_INFO
NickCrews Apr 11, 2023
3628de1
ref: improve growStringTo()
NickCrews Apr 11, 2023
f97e8b1
fix: don't reallocate extra byte in copyString()
NickCrews Apr 11, 2023
cfcebb9
ref: don't excessively pass around FEC_CONTEXT to _lineContainsF99Sta…
NickCrews Apr 11, 2023
472104d
ref: remove unused FEC_CONTEXT.currentLineLength, LINE_INFO.length
NickCrews Apr 11, 2023
be55068
fix: fix warning
NickCrews Apr 11, 2023
567e85d
ref: Move date and float csv writing to csv.c
NickCrews Apr 11, 2023
8c25c50
ref: Create CSV_FIELD as first class citizen in csv API
NickCrews Apr 11, 2023
3699c97
ref: tweak csv docstrings
NickCrews Apr 11, 2023
f7addbb
ref: csv: Remove advanceField(), replace columnIndex with numFieldsRead
NickCrews Apr 11, 2023
95d2a5d
ref: reduce level of nesting in parseLine()
NickCrews Apr 11, 2023
55c107b
ref: remove unused FEC_CONTEXT.summary
NickCrews Apr 11, 2023
1258352
ref: remove unused FEC_CONTEXT.f99Text
NickCrews Apr 11, 2023
e50e4b2
docs: improve memory.h docstring
NickCrews Apr 11, 2023
466cc9e
ref: reduce API of collectLineInfo()
NickCrews Apr 11, 2023
4f7a575
ref: simplify API of encoding.decodeLine()
NickCrews Apr 11, 2023
007c1c8
ref: Make freeString() OK with accepting NULLs
NickCrews Apr 11, 2023
0e846f9
ref: remove unused versionUsesAscii28() from fec.h
NickCrews Apr 11, 2023
6401f2b
ref: fix #includes
NickCrews Apr 11, 2023
bbe1e7a
ref: combine FEC_CONTEXT.version and versionLength into STRING
NickCrews Apr 11, 2023
8340587
ref: cleanup malloc()s, don't cast result
NickCrews Apr 11, 2023
763fbf9
ref: re-order stuff in FEC_CONTEXT
NickCrews Apr 11, 2023
1d7af46
ref: make path length calculation make more sense
NickCrews Apr 11, 2023
df17f90
ref: move path stuff from writer.c to new path.c
NickCrews Apr 11, 2023
6461d00
ref: add pathJoin() to path.h
NickCrews Apr 11, 2023
4f26872
ref: calculate outputDir outside of WRITE_CONTEXT
NickCrews Apr 11, 2023
12726ca
ref: name args when calling newFecContext()
NickCrews Apr 11, 2023
70776ed
ref: one arg per line for newFecContext()
NickCrews Apr 11, 2023
f7837de
ref: label args to newFecContext() in client.py
NickCrews Apr 11, 2023
91ade4a
ref: improve WriteCache in utils.py
NickCrews Apr 11, 2023
a524dd7
ref: simplify decoding bytes in provide_line_callback()
NickCrews Apr 11, 2023
2a790ee
ref: simplify provide_line_callback()
NickCrews Apr 11, 2023
1b8e100
ref: remove includeFilingId from FEC_CONTEXT
NickCrews Apr 12, 2023
cc2df27
BREAKING ref: re-order args to newFecContext()
NickCrews Apr 12, 2023
4919d82
ref: rename test case trailing_commas to too_few_fields
NickCrews Apr 12, 2023
827f903
test: rename case 1550126 to slash_form
NickCrews Apr 13, 2023
6005a40
test: ref: add helper assert method to better show errors
NickCrews Apr 13, 2023
7f5d80b
test: fix slash_form test
NickCrews Apr 13, 2023
0d70f9c
add types to client.py
NickCrews Apr 13, 2023
dc72c90
feat: Fail earlier on bad output_directory in parse_as_files()
NickCrews Apr 13, 2023
70b3431
docs: improve docstrings of python API
NickCrews Apr 13, 2023
4618d74
test: better error message on fail
NickCrews Apr 13, 2023
5c4c487
ref: improve comparison between numFieldsRead and schema.numFields
NickCrews Apr 13, 2023
a244b38
ref: clean up newline handling in ctxWarn()
NickCrews Apr 13, 2023
56b954c
fix: Don't warn on an empty float field
NickCrews Apr 13, 2023
5e0cd77
feat: put quotes around strings in warnings.
NickCrews Apr 13, 2023
8073321
BREAKING: feat: print correct number of fields be default
NickCrews Apr 13, 2023
b7eb213
Update VERSION
NickCrews Jul 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
bin
*.test
*.fec
!python/tests/fixtures/*.fec
!python/tests/cases/**/*.fec
output/
env/
zig-cache
Expand Down Expand Up @@ -38,4 +38,4 @@ share/python-wheels/
.installed.cfg
*.egg
*.whl
MANIFEST
MANIFEST
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.1.9
0.1.10
19 changes: 17 additions & 2 deletions build.zig
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,15 @@ pub fn build(b: *std.build.Builder) !void {

const libSources = [_][]const u8{
"src/buffer.c",
"src/mappings.c",
"src/memory.c",
"src/path.c",
"src/encoding.c",
"src/csv.c",
"src/writer.c",
"src/fec.c",
"src/regex.c",
"src/string_utils.c",
};
const pcreSources = [_][]const u8{
"src/pcre/pcre_chartables.c",
Expand All @@ -114,12 +118,23 @@ const pcreSources = [_][]const u8{
"src/pcre/pcre_version.c",
"src/pcre/pcre_xclass.c",
};
const tests = [_][]const u8{ "src/buffer_test.c", "src/csv_test.c", "src/writer_test.c", "src/cli_test.c" };
const testIncludes = [_][]const u8{ "src/buffer.c", "src/memory.c", "src/encoding.c", "src/csv.c", "src/writer.c", "src/cli.c" };
const tests = [_][]const u8{
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you should just ignore this PR, as I discover that this parsing is never even used, and I remove it in 22fcc2c

"src/buffer_test.c",
"src/csv_test.c",
"src/writer_test.c",
"src/cli_test.c",
"src/fec_test.c",
};
const testIncludes = libSources ++ [_][]const u8{
"src/cli.c",
};
const buildOptions = [_][]const u8{
"-std=c11",
"-pedantic",
"-Wall",
"-W",
"-Wno-missing-field-initializers",
// The string literals in mappings_generated.h are super long, which gives us
// warnings, but this isn't actually a problem unless we used som ancient compiler.
"-Wno-overlength-strings",
};
210 changes: 124 additions & 86 deletions python/src/fastfec/client.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
"""
A Python library to interface with FastFEC.
"""A Python library to interface with FastFEC.

This library provides methods to
* parse a .fec file line by line, yieling a parsed result
* parse a .fec file into parsed output .csv files
"""
from __future__ import annotations

import contextlib
import os
import io
import pathlib
from ctypes import CDLL, c_char_p, c_int, c_void_p
from queue import Queue
from threading import Thread
from typing import Any, Generator

from .utils import (
BUFFER_READ,
Expand All @@ -27,29 +28,39 @@


class LibFastFEC:
"""
Python wrapper for the fastfec library
"""
"""Python wrapper for the fastfec library."""

def __init__(self):
def __init__(self) -> None:
self.__init_lib()

# Initialize
self.persistent_memory_context = self.libfastfec.newPersistentMemoryContext()

def parse(self, file_handle, include_filing_id=None, should_parse_date=True):
"""
Parses the input file line-by-line
def parse(
self,
file_handle: io.BinaryIO,
include_filing_id: str | None = None,
should_parse_date: bool = True,
raw: bool = False,
) -> Generator[tuple[str, dict[str, Any]], None, None]:
"""Parses the input file line-by-line.

Arguments:
---------
file_handle -- An input stream for reading a .fec file
include_filing_id -- If set, prepend a column into each outputted csv for filing_id
with the specified filing id (defaults to None)
should_parse_date -- If true, yields parsed datetime.date objects for date fields; if
false, yields strings for date fields. This would mainly be set to
false for performance reasons (defaults to true)
include_filing_id -- If set, prepend a column into each outputted csv
for filing_id with the specified filing id.
should_parse_date -- If True, date fields are parsed to datetime.date.
If False, date fields are returned as raw YYYY-MM-DD
strings. This would mainly be set to false for
performance reasons.
raw -- If True, if there are fewer or more fields in a row than we
expect, the row will be written to the output file as-is.
If False, we will add empty fields, or skip extra fields,
to the row to make it the correct length.

Returns:
-------
A generator that receives the form name and a dictionary
object describing each line in the file
"""
Expand All @@ -58,26 +69,28 @@ def parse(self, file_handle, include_filing_id=None, should_parse_date=True):
done_processing = object() # A custom object to signal the end of processing

# Prepare the filing id to include, if specified
include_filing_id = as_bytes(include_filing_id)
filing_id_included = include_filing_id is not None
filing_id = as_bytes(include_filing_id)
filing_id_included = filing_id is not None

# Provide a custom line callback
buffer_read_fn = provide_read_callback(file_handle)
line_callback_fn = CUSTOM_LINE(provide_line_callback(queue, filing_id_included, should_parse_date))
line_callback_fn = CUSTOM_LINE(
provide_line_callback(queue, filing_id_included, should_parse_date),
)
fec_context = self.libfastfec.newFecContext(
self.persistent_memory_context,
buffer_read_fn,
BUFFER_SIZE,
CUSTOM_WRITE(0),
BUFFER_SIZE,
line_callback_fn,
0,
None,
include_filing_id,
None,
filing_id_included,
1,
0,
self.persistent_memory_context, # persistentMemory
buffer_read_fn, # bufferRead
BUFFER_SIZE, # inputBufferSize
CUSTOM_WRITE(0), # customWriteFunction
BUFFER_SIZE, # outputBufferSize
line_callback_fn, # customLineFunction
0, # writeToFile
None, # file
None, # outputDirectory
filing_id, # filingId
1, # silent
0, # warn
raw, # raw
)

# Run the parsing in a separate thread. It's essentially still single-threaded
Expand All @@ -87,7 +100,7 @@ def task():
self.libfastfec.parseFec(fec_context)
queue.put(done_processing) # Signal processing is over

Thread(target=task, args=()).start()
Thread(target=task, args=(), daemon=True).start()

# Yield processed lines
while True:
Expand All @@ -101,69 +114,97 @@ def task():
# Free FEC context
self.libfastfec.freeFecContext(fec_context)

def parse_as_files(self, file_handle, output_directory, include_filing_id=None):
"""
Parses the input file into output files in the output directory
def parse_as_files(
self,
file_handle: io.BinaryIO,
output_directory: str | pathlib.Path,
include_filing_id: str | None = None,
raw: bool = False,
) -> int:
"""Parses the input file into output files in the output directory.

Parent directories will be automatically created as needed.

Arguments:
---------
file_handle -- An input stream for reading a .fec file
output_directory -- A directory in which to place output parsed .csv files
include_filing_id -- If set, prepend a column into each outputted csv for filing_id
with the specified filing id (defaults to None)
include_filing_id -- If set, prepend a column `filing_id` into each
outputted csv filled with the specified value.
raw -- If True, if there are fewer or more fields in a row than we
expect, the row will be written to the output file as-is.
If False, we will add empty fields, or skip extra fields,
to the row to make it the correct length.

Returns:
-------
A status code. 1 indicates a successful parse, 0 an unsuccessful one.
"""
out_path = pathlib.Path(output_directory)

# Custom open method
def open_output_file(filename, *args, **kwargs):
filename = os.path.join(output_directory, filename)
output_file = pathlib.Path(filename)
output_file.parent.mkdir(exist_ok=True, parents=True)
def open_output_file(form_type: str, *args, **kwargs):
form_type = form_type.replace("/", "-")
path = out_path / form_type
path.parent.mkdir(exist_ok=True, parents=True)
# pylint: disable=consider-using-with,unspecified-encoding,bad-option-value
return open(filename, *args, **kwargs)
return open(path, *args, **kwargs)

return self.parse_as_files_custom(file_handle, open_output_file, include_filing_id=include_filing_id)
return self.parse_as_files_custom(
file_handle,
open_output_file,
include_filing_id=include_filing_id,
raw=raw,
)

def parse_as_files_custom(self, file_handle, open_function, include_filing_id=None):
"""
Parses the input file into output files
def parse_as_files_custom(
self,
file_handle: io.BinaryIO,
open_function,
include_filing_id: str | None = None,
raw: bool = False,
) -> int:
"""Parses the input file into output files.

Arguments:
---------
file_handle -- An input stream for reading a .fec file
open_function -- A function to open an output file for writing. This can be set to
customize the output stream for each parsed .csv file
include_filing_id -- If set, prepend a column into each outputted csv for filing_id
with the specified filing id (defaults to None)
open_function -- A function to open an output file for writing, given
a form type. This can be set to customize the output
stream for each parsed .csv file
include_filing_id -- If set, prepend a column `filing_id` into each
outputted csv filled with the specified value.
raw -- If True, if there are fewer or more fields in a row than we
expect, the row will be written to the output file as-is.
If False, we will add empty fields, or skip extra fields,
to the row to make it the correct length.

Returns:
-------
A status code. 1 indicates a successful parse, 0 an unsuccessful one.
"""
# Set callbacks
buffer_read_fn = provide_read_callback(file_handle)
write_callback_fn, free_file_descriptors = provide_write_callback(open_function)

# Prepare the filing id to include, if specified
include_filing_id = as_bytes(include_filing_id)
filing_id_included = include_filing_id is not None
filing_id = as_bytes(include_filing_id)

# Initialize fastfec context
fec_context = self.libfastfec.newFecContext(
self.persistent_memory_context,
buffer_read_fn,
BUFFER_SIZE,
write_callback_fn,
BUFFER_SIZE,
CUSTOM_LINE(0),
0,
None,
include_filing_id,
None,
filing_id_included,
1,
0,
self.persistent_memory_context, # persistentMemory
buffer_read_fn, # bufferRead
BUFFER_SIZE, # inputBufferSize
write_callback_fn, # customWriteFunction
BUFFER_SIZE, # outputBufferSize
CUSTOM_LINE(0), # customLineFunction
0, # writeToFile
None, # file
None, # outputDirectory
filing_id, # filingId
1, # silent
0, # warn
raw, # raw
)

# Parse
Expand All @@ -175,13 +216,11 @@ def parse_as_files_custom(self, file_handle, open_function, include_filing_id=No

return result

def free(self):
"""
Frees all the allocated memory from the fastfec library
"""
def free(self) -> None:
"""Frees all the allocated memory from the fastfec library."""
self.libfastfec.freePersistentMemoryContext(self.persistent_memory_context)

def __init_lib(self):
def __init_lib(self) -> None:
# Find the fastfec library
self.libfastfec = CDLL(find_fastfec_lib())

Expand All @@ -190,19 +229,19 @@ def __init_lib(self):
self.libfastfec.newPersistentMemoryContext.restype = c_void_p

self.libfastfec.newFecContext.argtypes = [
c_void_p,
BUFFER_READ,
c_int,
CUSTOM_WRITE,
c_int,
CUSTOM_LINE,
c_int,
c_void_p,
c_char_p,
c_char_p,
c_int,
c_int,
c_int,
c_void_p, # persistentMemory
BUFFER_READ, # bufferRead
c_int, # inputBufferSize
CUSTOM_WRITE, # customWriteFunction
c_int, # outputBufferSize
CUSTOM_LINE, # customLineFunction
c_int, # writeToFile
c_void_p, # file
c_char_p, # outputDirectory
c_char_p, # filingId
c_int, # silent
c_int, # warn
c_int, # raw
]
self.libfastfec.newFecContext.restype = c_void_p
self.libfastfec.parseFec.argtypes = [c_void_p]
Expand All @@ -212,9 +251,8 @@ def __init_lib(self):


@contextlib.contextmanager
def FastFEC(): # pylint: disable=invalid-name
"""
A convenience method to run fastfec and free memory afterwards
def FastFEC() -> Generator[LibFastFEC, None, None]: # pylint: disable=invalid-name
"""A convenience method to run fastfec and free memory afterwards.

Usage:

Expand Down
Loading