多种数据库连接工具_20多种热门数据工具及其不具备的功能

多种数据库连接工具

In the past few months, the data ecosystem has continued to burgeon as some parts of the stack consolidate and as new challenges arise. Our first attempt to help stakeholders navigate this ecosystem highlighted 25 Hot New Data Tools and What They DON’T Do — clarifying specific problems the featured companies and projects did and did NOT solve.

在过去的几个月中,随着堆栈中某些部分的合并以及新挑战的出现,数据生态系统继续蓬勃发展。 我们帮助利益相关者在这个生态系统中导航的首次尝试着重介绍了25个热门新数据工具及其不做的事情 -阐明了特色公司和项目已解决和未解决的具体问题。

This effort was positively received by the data science, engineering and analytics communities, and spurred more engagement than we originally anticipated. Further, we were flattered to see the original post motivate other thought-provoking pieces such as 20 Hot New Data Tools and their Early Go-to-Market Strategies.

这项努力得到了数据科学,工程和分析社区的积极欢迎,并激发了比我们最初预期更多的参与。 此外,我们很高兴看到原始帖子激发了其他发人深省的内容,例如20个热门新数据工具及其早期的进入市场策略

更进一步 (Taking it Further)

Regardless, we quickly recognized our original post did not go far enough as we received dozens of emails, Twitter messages and Slack DMs about other solutions that were not covered. We had shed light on a small corner of the expanding universe of data tools and platforms, yet there was an opportunity to cover even more.

无论如何,我们很快意识到我们的原始帖子远远不够,因为我们收到了数十封关于其他解决方案的电子邮件,Twitter消息和Slack DM,这些其他解决方案均未涵盖。 我们在不断扩展的数据工具和平台领域中发现了一个小角落,但仍有机会涵盖更多内容。

Although we cannot chronicle every additional data tool in just one follow-up post, here we continue our efforts to cultivate this ecosystem by highlighting a few more. The creators of these tools are not only occupying meaningful parts of the ever-evolving modern data stack, they graciously responded to our requests to help us understand where they fit in.

尽管我们无法仅在一个后续职位中列出所有其他数据工具,但在此我们通过重点介绍更多内容来继续努力培育这个生态系统。 这些工具的创建者不仅占据了不断发展的现代数据堆栈中有意义的部分,而且还亲切响应我们的要求,以帮助我们了解它们的适用范围。

They sound-off here in their own words.

他们在这里用自己的话说。

更多工具和响应 (More Tools and Responses)

  1. Shipyard: Shipyard is a workflow orchestration platform that helps teams quickly launch, monitor, and share data solutions without worrying about infrastructure management. It lets users create reusable blueprints, share data seamlessly between jobs, and run code without any proprietary setup, all while scaling resources dynamically. Shipyard is NOT a no-code tool and does not support data versioning or data visualization.

    造船厂 :造船厂是一个工作流程编排平台,可以帮助团队快速启动,监视和共享数据解决方案,而不必担心基础架构管理。 它使用户可以创建可重用的蓝图,在作业之间无缝共享数据,并且无需任何专有设置即可运行代码,而所有这些都可以动态扩展资源。 Shipyard不是一种非代码工具,并且不支持数据版本控制或数据可视化。

  2. Count: Count is a data notebook that replaces dashboards for reporting and self-service, and supports data transformation. Count is uniquely good at team collaboration, enabling technical and non-technical users to work within the same notebook. Count is NOT a data science notebook.

    Count :Count是一个数据笔记本,它取代了用于报告和自助服务的仪表板,并支持数据转换。 Count非常擅长团队协作,使技术和非技术用户都可以在同一笔记本上工作。 Count不是数据科学笔记本。

  3. Castor: Castor is uniquely good at organizing information about data to support data discovery, GDPR compliance, and knowledge management. Through a plug-and-play solution, Castor builds a comprehensive and actionable map of all data assets. Castor is NOT a data visualization or BI tool.

    Castor :Castor非常擅长组织有关数据的信息,以支持数据发现,GDPR合规性和知识管理。 通过即插即用解决方案,Castor可以构建所有数据资产的全面且可行的地图。 Castor不是数据可视化或BI工具。

  4. Census: Census is uniquely good at syncing data models from a warehouse to business tools like Salesforce. It complements existing warehouses, data loaders & transform tools to enable data teams to drive business operations. It is NOT a no-code tool nor does it automagically model your data; it relies on analysts writing models in SQL.

    人口普查 :人口普查在将数据模型从仓库同步到Salesforce等业务工具方面具有独特的优势。 它是对现有仓库,数据加载器和转换工具的补充,以使数据团队能够推动业务运营。 它不是无代码工具,也不是自动对数据建模的工具。 它依靠分析师用SQL编写模型。

  5. Iteratively: Iteratively is a schema registry that helps teams collaborate to define, instrument, and validate their analytics. With Iteratively, you can ship high-quality analytics faster and prevent common data quality & privacy issues that undermine trust. Iteratively is NOT a BI tool, data pipeline, or transformation tool.

    反复进行 :反复进行是一个架构注册表,可以帮助团队协作来定义,检测和验证其分析。 借助迭代,您可以更快地交付高质量的分析,并防止破坏信任的常见数据质量和隐私问题。 迭代地不是BI工具,数据管道或转换工具。

  6. StreamSQL: StreamSQL handles deploying, versioning, and sharing model features. Using your definitions, it generates features for both serving and training. Its registry facilitates re-using features across teams and models. Stream does NOT model management and is completely agnostic to what you do with the features once you get them.

    StreamSQL :StreamSQL处理部署,版本控制和共享模型功能。 使用您的定义,它可以为服务和培训生成功能。 其注册表有助于跨团队和模型重用功能。 Stream不对管理进行建模,一旦获得这些功能,您将完全不知所措。

  7. Xplenty: Xplenty is a cloud-based ETL solution providing simple visualized data pipelines for automated data flows across a wide range of sources and destinations. It is uniquely good at ingesting large volumes of data, performing code-free data transformations, and scheduling workflows. Xplenty does NOT do event streaming.

    Xplenty :Xplenty是基于云的ETL解决方案,它提供了简单的可视化数据管道,用于跨各种来源和目的地的自动化数据流。 它在吸收大量数据,执行无代码的数据转换以及调度工作流方面具有独特的优势。 Xplenty不执行事件流传输。

  8. Vectice: Vectice is uniquely good at tracking, documenting, organizing all AI assets (e.g datasets, features, models, experiments, dashboards, notebooks) and the underlying domain knowledge to successfully manage and scale the enterprise AI initiatives. Vectice does NOT provide any runtime or computational environment.

    Vectice :Vectice独特地擅长跟踪,记录,组织所有AI资产(例如,数据集,功能,模型,实验,仪表板,笔记本)和基础领域知识,以成功管理和扩展企业AI计划。 Vectice不提供任何运行时或计算环境。

  9. Snowplow Analytics: Snowplow is a streaming behavioral data engine that is uniquely good at generating event data from dedicated web/mobile/server SDKs, enhancing that data and delivering it to your data warehouse. Snowplow is NOT a data integration (ELT) tool, nor a general streaming framework, nor a BI tool.

    Snowplow Analytics :Snowplow是一种流式行为数据引擎,非常擅长从专用的Web /移动/服务器SDK生成事件数据,增强该数据并将其传递到您的数据仓库。 Snowplow并不是数据集成(ELT)工具,也不是通用的流框架,也不是BI工具。

  10. Datafold: Datafold is uniquely good at comparing datasets in a SQL data warehouse or across data warehouses. It enables running “git diff” on a table of any size. Datafold is NOT a database itself (it works on top of existing infrastructure) and it does NOT work with files.

    数据折叠 :数据折叠独特地擅长比较SQL数据仓库或跨数据仓库中的数据集。 它允许在任何大小的表上运行“ git diff”。 Datafold本身不是数据库(它可以在现有基础结构之上运行),并且不能与文件一起使用。

  11. Splitgraph: Splitgraph is a tool for building, extending, versioning, and sharing SQL databases that is uniquely good at enhancing existing tools. Splitgraph also features a data catalogue including 40K open datasets that can be queried (and joined) with any SQL client. Splitgraph is NOT a database.

    Splitgraph :Splitgraph是用于构建,扩展,版本控制和共享SQL数据库的工具,该工具独特地擅长于增强现有工具。 Splitgraph还具有一个数据目录,其中包括可以与任何SQL客户端查询(和联接)的4万个开放数据集。 Splitgraph不是数据库。

  12. Datacoral: Datacoral is uniquely good at automatically generating data ingestion and transformation pipelines from SQL-based declarative specifications, and automatically capturing and displaying schema level lineage. Datacoral plays nice with data ingestion tools like Segment, and workflow management tools like Airflow. Datacoral is NOT a data warehouse or a query engine.

    Datacoral :Datacoral擅长于根据基于SQL的声明性规范自动生成数据提取和转换管道,以及自动捕获和显示架构级别的沿袭。 Datacoral可以与数据吸收工具(例如细分)和工作流管理工具(例如Airflow)配合使用。 Datacoral不是数据仓库或查询引擎。

  13. Apache Arrow: Apache Arrow is uniquely good as a language-independent standard for fast in-memory analytical processing and efficient interprocess transport (with minimal overhead) of large tabular datasets. While intended as a computational foundation for data frame projects, it is NOT a replacement for end-user facing tools like pandas.

    Apache Arrow :Apache Arrow作为独立于语言的标准非常出色,可用于大型表格数据集的快速内存内分析处理和高效的进程间传输(开销最小)。 虽然旨在作为数据框架项目的计算基础,但它并不能替代面向最终用户的工具(如熊猫)。

  14. Datasaur: Datasaur is built to support NLP labeling via ML-assisted suggestions. It supports workforce management, maintains data privacy, and can be integrated via API to any ML workflow. Datasaur does NOT handle bounding boxes for image/video labeling.

    Datasaur :Datasaur旨在通过ML辅助建议来支持NLP标记。 它支持劳动力管理,维护数据隐私,并且可以通过API集成到任何ML工作流程中。 Datasaur不处理图像/视频标签的边框。

  15. Datakin: Datakin is a DataOps solution that helps guarantee that data pipelines run without disruption and resulting data can be trusted. It does so by automatically discovering data lineage and providing tools to quickly identify and resolve issues. Datakin is NOT a data catalog nor does it replace any existing data infrastructure components (workflow orchestration, data processing, …).

    Datakin :Datakin是DataOps解决方案,可帮助确保数据管道运行不中断,并且可以信任生成的数据。 它通过自动发现数据沿袭并提供工具来快速识别和解决问题来做到这一点。 Datakin不是数据目录,也不代替任何现有的数据基础架构组件(工作流程编排,数据处理等)。

  16. ApertureData: ApertureData is a database for visual data like images, videos, feature vectors, and associated metadata like annotations. It natively supports complex searching and preprocessing operations over media objects, and integrates with cloud-based storage and ML frameworks like PyTorch/Tensorflow.. ApertureData does NOT extract metadata or features from images/videos.

    ApertureData :ApertureData是一个数据库,用于存储视觉数据,例如图像,视频,特征向量以及相关的元数据(例如注释)。 它本身支持对媒体对象的复杂搜索和预处理操作,并与基于云的存储和ML框架(如PyTorch / Tensorflow)集成。.ApertureData不会从图像/视频中提取元数据或特征。

  17. Orchest: Orchest is uniquely good at assisting data scientists in interactively building data science pipelines by providing a visual pipeline editing environment in the browser. Pipeline steps are containerized notebooks or scripts. Orchest does NOT replace Jupyter notebooks, provide a no-code tool, or bring its own computational infrastructure.

    Orchest :Orchest独特地擅长通过在浏览器中提供可视化的管道编辑环境来协助数据科学家以交互方式构建数据科学管道。 管道步骤是容器化的笔记本或脚本。 Orchest不会替换Jupyter笔记本,提供无代码工具或拥有自己的计算基础结构。

  18. Gazette: Gazette is an open source streaming platform that breaks down the divide between batch and real-time data, enabling users to build real-time applications with exactly-once semantics. It offers real-time message streams, which are natively and durably stored as regular files in cloud storage. Gazette is NOT an ETL tool or an analytics platform.

    Gazette :Gazette是一个开放源代码的流媒体平台,可打破批处理数据与实时数据之间的鸿沟,使用户能够使用一次精确的语义构建实时应用程序。 它提供了实时消息流,这些消息流作为常规文件以本地和持久方式存储在云存储中。 宪报不是ETL工具或分析平台。

  19. Coiled Computing: Coiled excels at scaling data science and machine learning workflows in native Python using Dask, which is familiar, widely adopted, and gives great feedback. Coiled is an opinionated way of bursting to clusters and the cloud while staying in the PyData ecosystem. Coiled/Dask is NOT a database or Kubernetes replacement.

    Coiled Computing :Coiled在使用达斯克(Dask)来扩展本地Python中的数据科学和机器学习工作流程方面表现出色,该工具已被熟悉,被广泛采用并提供了很好的反馈。 盘绕是一种固守在PyData生态系统中而突然爆发的集群和云方法。 Coiled / Dask不是数据库或Kubernetes的替代品。

  20. Upsolver: Upsolver is a cloud-native solution for integrating structured and unstructured data on cloud storage. It utilizes a visual, SQL interface for quick and easy data transformation. Upsolver is NOT a Platform as a Service solution that requires developers to write additional code and learn low-level concepts to process data.

    Upsolver :Upsolver是一种云原生解决方案,用于在云存储上集成结构化和非结构化数据。 它利用可视化SQL界面进行快速轻松的数据转换。 Upsolver并非平台即服务解决方案,它要求开发人员编写其他代码并学习低级概念来处理数据。

As authors (Sarah, Abe & Pete) we’re collectively brainstorming about how we can extend this effort and create an ever-growing list that helps practitioners find and adopt the right tools, founders align with the best partners, and investors map companies to their investment theses. We look forward to hearing your thoughts on the best medium to continue this exploration with the support of the community.

作为作者( SarahAbePete ),我们正在集体商讨如何扩展这项工作并创建一个不断增长的清单,以帮助从业人员找到并采用正确的工具,创始人与最佳合作伙伴保持一致,以及投资者将公司定位于他们的投资论文。 我们期待听到您在最佳媒体上的想法,以便在社区的支持下继续进行这一探索。

翻译自: https://towardsdatascience.com/20-more-hot-data-tools-and-what-they-dont-do-46bc365bea74

多种数据库连接工具

    原文作者:weixin_26746401
    原文地址: https://blog.csdn.net/weixin_26746401/article/details/108499005
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞