使用通用文件系统访问远程数据

colorful big data pattern background

最初是为了满足 Dask 的需求而创建的,我们已经开发出一个通用的文件系统实现和规范,为所有用户提供对多种本地、集群和远程存储介质的简单访问。Dask 和 Intake 已经转而使用这个新的软件包:fsspec

简介

在这里,我们讨论的是从某个位置获取原始字节的底层业务。我们习惯于在本地磁盘上执行此操作,但是与其他存储机制通信可能很棘手,并且在每种情况下都不同。例如,考虑一下从 Hadoop 读取文件、从您拥有 SSH 凭据的服务器读取文件,或者从 Amazon S3 等云存储服务读取文件的不同方式。由于这些问题对于处理大数据至关重要,因此我们开发了代码来补充 Dask 以完成这项工作,并发布了诸如 s3fsgcsfs 之类的软件包。

我们发现,这些软件包是独立构建和发布的,即使没有 Dask 也很受欢迎,部分原因是它们被诸如 pandasxarray 之类的其他 PyData 库使用。因此,我们意识到处理任意文件系统以及将 URL 映射到字节的有用代码的一般想法不应埋没在 Dask 中,而应公开并提供给所有人,即使他们对并行/核外计算不感兴趣。

示例

考虑以下代码行(使用 s3fs

<span class="pl-k">>></span><span class="pl-k">></span> <span class="pl-k">import</span> fsspec
<span class="pl-k">>></span><span class="pl-k">></span> of <span class="pl-k">=</span> fsspec.open(<span class="pl-s"><span class="pl-pds">"</span>s3://anaconda-public-datasets/iris/iris.csv<span class="pl-pds">"</span></span>, <span class="pl-v">mode</span><span class="pl-k">=</span><span class="pl-s"><span class="pl-pds">'</span>rt<span class="pl-pds">'</span></span>, <span class="pl-v">anon</span><span class="pl-k">=</span><span class="pl-c1">True</span>)
<span class="pl-k">>></span><span class="pl-k">></span> <span class="pl-k">with</span> of <span class="pl-k">as</span> f:
<span class="pl-c1">...</span> <span class="pl-c1">print</span>(f.readline())
<span class="pl-c1">...</span> <span class="pl-c1">print</span>(f.readline())
<span class="pl-c1">5.1</span>,<span class="pl-c1">3.5</span>,<span class="pl-c1">1.4</span>,<span class="pl-c1">0.2</span>,Iris<span class="pl-k">-</span>setosa
<span class="pl-c1">4.9</span>,<span class="pl-c1">3.0</span>,<span class="pl-c1">1.4</span>,<span class="pl-c1">0.2</span>,Iris<span class="pl-k">-</span>setosa

这解析了一个 URL 并启动了一个会话以与 AWS S3 通信,以文本模式读取特定的键。请注意,我们指定这是一个匿名连接(对于那些没有 S3 凭据的人,因为此数据是公开的)。对象 of 是一个可序列化的 OpenFile,它仅在上下文中与远程服务通信;但是 f 是一个常规的类文件对象,可以传递给许多期望使用诸如 readline() 之类的方法的 python 函数。输出是来自著名的 Iris Dataset 的两行数据。

此文件以未压缩方式存储,并且可以在随机访问字节模式下打开。

这允许查找和提取潜在的大文件中的较小部分,而无需下载整个文件。在处理大数据时,这对于本地数据探索以及云中的并行处理(这正是 Dask 使用 fsspec 的方式)都很有用。

现在与以下内容进行比较

<span class="pl-k">>></span><span class="pl-k">></span> of <span class="pl-k">=</span> fsspec.open(<span class="pl-s"><span class="pl-pds">"</span>https://datahub.io/machine-learning/iris/r/iris.csv<span class="pl-pds">"</span></span>)
<span class="pl-k">>></span><span class="pl-k">></span> <span class="pl-k">with</span> of <span class="pl-k">as</span> f:
<span class="pl-c1">...</span> <span class="pl-c1">print</span>(f.readline())
<span class="pl-c1">...</span> <span class="pl-c1">print</span>(f.readline())
<span class="pl-c1">...</span> <span class="pl-c1">print</span>(f.readline())
sepallength,sepalwidth,petallength,petalwidth,<span class="pl-k">class</span>
<span class="pl-c1">5.1</span>,<span class="pl-c1">3.5</span>,<span class="pl-c1">1.4</span>,<span class="pl-c1">0.2</span>,Iris<span class="pl-k">-</span>setosa
<span class="pl-c1">4.9</span>,<span class="pl-c1">3.0</span>,<span class="pl-c1">1.4</span>,<span class="pl-c1">0.2</span>,Iris<span class="pl-k">-</span>setosa

这为 HTTP 位置使用了不同的后端,但具有完全相同的 API。(输出略有不同,因为此数据集版本包含标题行。)

远程文件系统

或者,您可以使用文件系统实例,这些实例具有您期望的所有方法,灵感来自内置的 os 模块

<span class="pl-k">>></span><span class="pl-k">></span> fs <span class="pl-k">=</span> fsspec.filesystem(<span class="pl-s"><span class="pl-pds">'</span>s3<span class="pl-pds">'</span></span>, <span class="pl-v">anon</span><span class="pl-k">=</span><span class="pl-c1">True</span>)
<span class="pl-k">>></span><span class="pl-k">></span> fs.ls(<span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets<span class="pl-pds">'</span></span>)
[<span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/enron-email<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/fashion-mnist<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/gdelt<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/iris<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/nyc-taxi<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/reddit<span class="pl-pds">'</span></span>]
<span class="pl-k">>></span><span class="pl-k">></span> fs.info(<span class="pl-s"><span class="pl-pds">"</span>anaconda-public-datasets/iris/iris.csv<span class="pl-pds">"</span></span>)
{<span class="pl-s"><span class="pl-pds">'</span>Key<span class="pl-pds">'</span></span>: <span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/iris/iris.csv<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>LastModified<span class="pl-pds">'</span></span>: datetime.datetime(<span class="pl-c1">2017</span>, <span class="pl-c1">8</span>, <span class="pl-c1">10</span>, <span class="pl-c1">16</span>, <span class="pl-c1">35</span>, <span class="pl-c1">36</span>, <span class="pl-v">tzinfo</span><span class="pl-k">=</span>tzutc()),
 <span class="pl-s"><span class="pl-pds">'</span>ETag<span class="pl-pds">'</span></span>: <span class="pl-s"><span class="pl-pds">'</span>"f47788bbfca239ad319aa7a3b038fc71"<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>Size<span class="pl-pds">'</span></span>: <span class="pl-c1">4700</span>,
 <span class="pl-s"><span class="pl-pds">'</span>StorageClass<span class="pl-pds">'</span></span>: <span class="pl-s"><span class="pl-pds">'</span>STANDARD<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>type<span class="pl-pds">'</span></span>: <span class="pl-s"><span class="pl-pds">'</span>file<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>size<span class="pl-pds">'</span></span>: <span class="pl-c1">4700</span>,
 <span class="pl-s"><span class="pl-pds">'</span>name<span class="pl-pds">'</span></span>: <span class="pl-s"><span class="pl-pds">'</span>anaconda-public-datasets/iris/iris.csv<span class="pl-pds">'</span></span>}

关键是,对于几个 后端文件系统 中的任何一个,您(几乎)执行完全相同的操作,并且可以免费获得额外功能的好处

  • 使用 open()open_files() 函数进行透明解压缩和文本模式,后者将自动扩展 glob 字符串
  • 目录的键值字典视图
<span class="pl-k">>></span><span class="pl-k">></span> m <span class="pl-k">=</span> fsspec.get_mapper(<span class="pl-s"><span class="pl-pds">'</span>s3://zarr-demo/store<span class="pl-pds">'</span></span>, <span class="pl-v">anon</span><span class="pl-k">=</span><span class="pl-c1">True</span>)
<span class="pl-k">>></span><span class="pl-k">></span> <span class="pl-c1">list</span>(m)
[<span class="pl-s"><span class="pl-pds">'</span>.zattrs<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>.zgroup<span class="pl-pds">'</span></span>,
 <span class="pl-s"><span class="pl-pds">'</span>foo/.zattrs<span class="pl-pds">'</span></span>,
<span class="pl-c1">...</span>]
<span class="pl-k">>></span><span class="pl-k">></span> m[<span class="pl-s"><span class="pl-pds">'</span>.zattrs<span class="pl-pds">'</span></span>]
<span class="pl-s"><span class="pl-k">b</span><span class="pl-pds">'</span>{}<span class="pl-pds">'</span></span>
  • 事务性写入:所有文件仅在上下文结束时才最终确定,并且在发生异常的情况下,将被回滚/丢弃
<span class="pl-k">>></span><span class="pl-k">></span> fs <span class="pl-k">=</span> fsspec.filesystem(<span class="pl-s"><span class="pl-pds">'</span>s3<span class="pl-pds">'</span></span>) <span class="pl-c"># requires credentials</span>
<span class="pl-k">>></span><span class="pl-k">></span> <span class="pl-k">with</span> fs.transaction:
<span class="pl-c1">...</span> fs.put(<span class="pl-s"><span class="pl-pds">'</span>localfile<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>mybucket/remotefile<span class="pl-pds">'</span></span>)
<span class="pl-c1">...</span> <span class="pl-k">raise</span> <span class="pl-c1">RuntimeError</span> 
<span class="pl-k">>></span><span class="pl-k">></span> <span class="pl-k">assert</span> <span class="pl-k">not</span> fs.exists(<span class="pl-s"><span class="pl-pds">'</span>mybucket/remotefile<span class="pl-pds">'</span></span>)

统一界面

Intake 现在依赖于 fsspec 进行文件处理。Intake 的目的是简化查找和加载数据的过程,因此能够浏览任何文件系统(或可以被认为是文件系统的任何事物)非常重要。在主 GUI 中,您现在可以在许多可能的后端之间进行选择,而不仅仅是本地文件

这使您可以选择可能是远程的目录,并使它们也加载远程数据。当然,您仍然需要安装相关的驱动程序,并且确实需要访问与协议匹配的服务(kwargs 框的存在是为了添加后端可能需要的任何其他参数)。例如,在以下图像中,我可以浏览 S3,并将我拥有的所有存储桶视为“目录”。这不需要额外的配置,因为我的 S3 凭据存储在系统上。

规范

一个更微妙的点是,关于如何处理文件的许多逻辑在不同实现中是通用的。fsspec 软件包包含一个规范,供其他文件系统实现从中派生,从而使编写新的文件系统包装器(它将与 Dask、Intake 和其他工具兼容)的过程变得更加简单。此类实现还将继承许多免费功能。

因此,我邀请所有感兴趣的开发人员与我们联系,以了解您如何实现您最喜欢的文件系统。

与专家交流

与我们的专家之一交流,为您的 AI 之旅找到解决方案。

与专家交流