Show HN: Desbordante 2.0 – A high-performance data profiler

  • Posted 2 weeks ago by chernishev
  • 67 points
https://github.com/Desbordante/desbordante-core
Hi! We are excited to announce the second release of Desbordante — an open-source, high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms.

Unlike existing data profilers, Desbordante focuses on discovering complex patterns in data, which are notoriously hard to extract. Since its inception in 2019, it has become the fastest open-source tool for these tasks. It also offers an array of patterns which have no alternative implementations. With this release, Desbordante now supports 17 types of patterns, such as: various types of functional dependencies, inclusion and order dependencies, fuzzy algebraic constraints and many others.

Some ways in which Desbordante can be helpful are: 1) Hypothesis generation for scientists that work with large volumes of data. 2) Business data owners and business analysts can benefit from hypothesis generation as well as data quality improvement: cleaning databases from errors, finding and removing inexact duplicates, and so on. 3) Found primitives can help data scientists in feature engineering and choosing the right direction for ablation studies.

Desbordante solves two types of tasks: Discovery and Validation. The Discovery task is designed to identify all instances of a specified pattern type of a given dataset. The Validation task is different: it is designed to check whether a specified pattern instance is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values).

Desbordante offers a CLI, a web application, and a Python library. The latter makes it possible to construct ad-hoc data analysis pipelines — essentially, your own applications for various data quality tasks: data cleaning, data deduplication, anomaly detection, data schema recovery and many others. You can check out example implementations here: https://github.com/Desbordante/desbordante-core/tree/main/ex....

Check out some of our articles for more details:

https://medium.com/@chernishev/exploratory-data-analysis-wit...

https://itnext.io/building-a-simple-data-cleaning-applicatio...

https://levelup.gitconnected.com/checking-mining-and-explori...

This major release brings a lot of improvements: support for several novel patterns, support for novel data type — graphs, added python bindings for existing patterns, better guides and examples and more. The detailed changelog can be seen here (https://github.com/Desbordante/desbordante-core/releases/tag...).

4 comments

    Loading..
    Loading..
    Loading..
    Loading..