`universal_transfer_operator.datasets.file.base`

Module Contents

Classes

File

Repersents all file dataset.

class universal_transfer_operator.datasets.file.base.File

Bases: universal_transfer_operator.datasets.base.Dataset

Repersents all file dataset.

Parameters:

path – Path to a file in the filesystem/Object stores
conn_id – Airflow connection ID
filetype – constant to provide an explicit file type
normalize_config – parameters in dict format of pandas json_normalize() function.
is_bytes – is bytes

property location

property size: int

Return the size in bytes of the given file.

Returns:: File size in bytes
Return type:: int

property type: universal_transfer_operator.datasets.file.types.base.FileTypes

Return type:: universal_transfer_operator.datasets.file.types.base.FileTypes

path: str

conn_id: str

filetype: FileTypeConstant | None

normalize_config: dict | None

is_bytes: bool = False

uri: str

extra: dict

is_dataframe: bool = False

is_binary()

Return a constants.FileType given the filepath. Uses a native strategy, using the file extension.

Returns:: True or False
Return type:: bool

is_pattern()

Returns True when file path is a pattern(eg. s3://bucket/folder or /folder/sample_* etc)

Returns:: True or False
Return type:: bool

create_from_dataframe(df, store_as_dataframe=True)

Create a file in the desired location using the values of a dataframe.

Parameters:

df (pandas.DataFrame) – pandas dataframe
store_as_dataframe (bool) – Whether the data should later be deserialized as a dataframe or as a file containing delimited data (e.g. csv, parquet, etc.).

Return type:

None

export_to_dataframe(**kwargs)

Read file from all supported location and convert them into dataframes.

Return type:: pandas.DataFrame

export_to_dataframe_via_byte_stream(**kwargs)

Read files from all supported locations and convert them into dataframes. Due to noted issues with using smart_open with pandas (like https://github.com/RaRe-Technologies/smart_open/issues/524), we create a BytesIO or StringIO buffer before exporting to a dataframe. We’ve found a sizable speed improvement with this optimization.

Return type:: pandas.DataFrame

exists()

Check if the file exists or not

Return type:: bool

classmethod from_json(serialized_object)

Parameters:: serialized_object (dict) –

universal_transfer_operator.datasets.file.base

Module Contents

Classes

`universal_transfer_operator.datasets.file.base`