Apache Griffin


Apache Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context.

Overview of Apache Griffin

When people use big data (Hadoop or other streaming systems), measurement of data quality is a big challenge. Different teams have built customized tools to detect and analyze data quality issues within their own domains. As a platform organization, we think of taking a platform approach to commonly occurring patterns. As such, we are building a platform to provide shared Infrastructure and generic features to solve common data quality pain points. This would enable us to build trusted data assets.

Currently it is very difficult and costly to do data quality validation when we have large volumes of related data flowing across multi-platforms (streaming and batch). Take eBay’s Real-time Personalization Platform as a sample; Everyday we have to validate the data quality for ~600M records. Data quality often becomes one big challenge in this complex environment and massive scale.

We detect the following at eBay:

  1. Lack of an end-to-end, unified view of data quality from multiple data sources to target applications that takes into account the lineage of the data. This results in a long time to identify and fix data quality issues.
  2. Lack of a system to measure data quality in streaming mode through self-service. The need is for a system where datasets can be registered, data quality models can be defined, data quality can be visualized and monitored using a simple tool and teams alerted when an issue is detected.
  3. Lack of a Shared platform and API Service. Every team should not have to apply and manage own hardware and software infrastructure to solve this common problem.

With these in mind, we decided to build Apache Griffin - A data quality service that aims to solve the above short-comings.

Apache Griffin includes:

Data Quality Model Engine: Apache Griffin is model driven solution, user can choose various data quality dimension to execute his/her data quality validation based on selected target data-set or source data-set ( as the golden reference data). It has corresponding library supporting it in back-end for the following measurement:

  • Accuracy - Does data reflect the real-world objects or a verifiable source
  • Completeness - Is all necessary data present
  • Validity - Are all data values within the data domains specified by the business
  • Timeliness - Is the data available at the time needed
  • Anomaly detection - Pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset
  • Data Profiling - Apply statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic.

Data Collection Layer:

We support two kinds of data sources, batch data and real time data.

For batch mode, we can collect data source from our Hadoop platform by various data connectors.

For real time mode, we can connect with messaging system like Kafka to near real time analysis.

Data Process and Storage Layer:

For batch analysis, our data quality model will compute data quality metrics in our spark cluster based on data source in hadoop.

For near real time analysis, we consume data from messaging system, then our data quality model will compute our real time data quality metrics in our spark cluster. for data storage, we use time series database in our back end to fulfill front end request.

Apache Griffin Service:

We have RESTful web services to accomplish all the functionalities of Apache Griffin, such as register data-set, create data quality model, publish metrics, retrieve metrics, add subscription, etc. So, the developers can develop their own user interface based on these web serivces.

Main business process

Architecture diagram

Tech stack


The challenge we face at eBay is that our data volume is becoming bigger and bigger, systems process become more complex, while we do not have a unified data quality solution to ensure the trusted data sets which provide confidences on data quality to our data consumers. The key challenges on data quality includes:

  1. Existing commercial data quality solution cannot address data quality lineage among systems, cannot scale out to support fast growing data at eBay
  2. Existing eBay’s domain specific tools take a long time to identify and fix poor data quality when data flowed through multiple systems
  3. Business logic becomes complex, requires data quality system much flexible.
  4. Some data quality issues do have business impact on user experiences, revenue, efficiency & compliance.
  5. Communication overhead of data quality metrics, typically in a big organization, which involve different teams.

The idea of Apache Apache Griffin is to provide Data Quality validation as a Service, to allow data engineers and data consumers to have:

  • Near real-time understanding of the data quality health of your data pipelines with end-to-end monitoring, all in one place.
  • Profiling, detecting and correlating issues and providing recommendations that drive rapid and focused troubleshooting
  • A centralized data quality model management system including rule, metadata, scheduler etc.
  • Native code generation to run everywhere, including Hadoop, Kafka, Spark, etc.
  • One set of tools to build data quality pipelines across all eBay data platforms.


Apache Griffin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.


Release Notes - Apache Griffin 0.1.6 (incubating)

  • Highlights
    • Streaming: measure streaming data quality based on defined measurements.
    • Support Griffin DSL and SQL to define data quality measurement.
    • Support multiple data connectors and data sources.
    • Fully support headless interact with restful api.
  • New Feature
    • [GRIFFIN-40] - Enhance DSL of Griffin, to support more types of measurement as accuracy, profiling.
    • [GRIFFIN-6 ] - Onboard streaming model for accuracy, profiling.
  • Improvement
    • [GRIFFIN-26] - Support profiling measure process in measurement
    • [GRIFFIN-52] - Upgrade angularJS to angular2 for ui
  • Bug
    • [GRIFFIN-31] - Localedatestring is not valid for backend tracking
    • [GRIFFIN-38] - Fix bugs of job instance status in service
    • [GRIFFIN-48] - Fix measure deletion bug
    • [GRIFFIN-49] - Fix bug of cache for hive metastore data
    • [GRIFFIN-37] - Jobs, UI should update previous fire time and next fire time in real-time
    • [GRIFFIN-35] - Job instance state update problem
    • [GRIFFIN-33] - Target Partition form need validation as source partition
    • [GRIFFIN-34] - When create job, each input box should have a format check
  • Task
    • [GRIFFIN-66] - Upgrade our maven build system for angular 2 integration
    • [GRIFFIN-64] - Document for griffin dsl and samples

Release Notes - Apache Griffin 0.1.5 (incubating)

  • Highlights

    • Batch: measure data quality based on user defined mesurements.
    • Standard process to define,measure and report data quality dimensions.
    • Dashboard to interact with griffin for whole data quality cycle.
  • New Feature

    • [GRIFFIN-11] - Enable data quality Accuracy measure in batch mode
    • [GRIFFIN-17] - Create a scheduler to schedule measure jobs
  • Improvement

    • [GRIFFIN-9] - Setup public live demo
    • [GRIFFIN-8] - New awesome griffin logo
  • Bug
    • [GRIFFIN-32] - Fix license header, by using SOURCE FILE HEADERS FOR CODE DEVELOPED AT THE ASF
    • [GRIFFIN-31] - localedatestring is not valid for backend tracking
    • [GRIFFIN-18] - The selection of hive data source can not get correct metadata from the tables in non-default database
    • [GRIFFIN-23] - Modify ‘models’ to ‘measures’, and ‘create dq model’ to ‘create dq measure’
    • [GRIFFIN-25] - Remove the portal of data assets registration from UI
    • [GRIFFIN-5] - Fix error in merge PR script
  • Task
    • [GRIFFIN-30] - Fix license issue reported by Justin.
    • [GRIFFIN-4] - Rename Griffin to Apache Griffin in documents
    • [GRIFFIN-2] - Setup griffin website on apache
    • [GRIFFIN-1] - Refactor service code to make it more open and extensible

Apache Griffin (incubating)- Downloads


How to contribute

Ask questions!

The Apache Griffin community is eager to help and to answer your questions. We have a user mailing list.


To subscribe dev list

To unsubscribe dev list

File a bug report

Please let us know if you experienced a problem with Griffin and file a bug report. Open Griffin’s JIRA and click on the blue Create button at the top. Please give detailed information about the problem you encountered and, if possible, add a description that helps to reproduce the problem.


Propose an improvement or a new feature

Our community is constantly looking for feedback to improve Apache Griffin. If you have an idea how to improve Griffin or have a new feature in mind that would be beneficial for Griffin users, please open an issue in Griffin’s JIRA. The improvement or new feature should be described in appropriate detail and include the scope and its requirements if possible.

We recommend to first reach consensus with the community on whether a new feature is required and how to implement a new feature, before starting with the implementation.


Help others and join the discussions

Most communication in the Apache Griffin community happens on two mailing lists:

The user mailing lists user@griffin.incubator.apache.org is the place where users of Apache Griffin ask questions and seek for help or advice. Joining the user list and helping other users is a very good way to contribute to Griffin’s community.

The development mailing list dev@griffin.incubator.apache.org is the place where Griffin developers exchange ideas and discuss new features, upcoming releases, and the development process in general. If you are interested in contributing code to Griffin, you should join this mailing list.

You are welcome to subscribe to both mailing lists.

Contributing to Code

  • Create jira ticket to specify what you want to do

    create ticket here.
  • Create one new branch for this task

    # first fork this repo -- https://github.com/apache/incubator-griffin.git
    git clone https://github.com/{YOURNAME}/incubator-griffin.git
    # code and push to your repository
  • Commit and send PR to us

    ###please associate related JIRA TICK in your comments
    git commit -am "For task GRIFFIN-10 , blabla..."
  • GRIFFIN PPMC will review and accept your pr as contributing.

Contributing to Document

  • Contribute to source document

  • Contribute to griffin site

  • Contribute to griffin document

How to become a committer

Committers are community members that have write access to the project’s repositories, i.e., they can modify the code, documentation, and website by themselves and also accept other contributions.

There is no strict protocol for becoming a committer. Candidates for new committers are typically people that are active contributors and community members.

Being an active community member means participating on mailing list discussions, helping to answer questions, verifying release candidates, being respectful towards others, and following the meritocratic principles of community management. Since the “Apache Way” has a strong focus on the project community, this part is very important.

Of course, contributing code and documentation to the project is important as well. A good way to start is contributing improvements, new features, or bug fixes. You need to show that you take responsibility for the code that you contribute, add tests and documentation, and help maintaining it.

Candidates for new committers are suggested by current committers or PMC members, and voted upon by the PMC.

If you would like to become a committer, you should engage with the community and start contributing to Apache Griffin in any of the above ways. You might also want to talk to other committers and ask for their advice and guidance.



Group Component Description
Measure accuracy accuracy measure between single source of truth and target
Measure profiling profiling target data asset, providing statistics by different rules or dimensions
Measure completeness are all data persent
Measure timeliness are data available at the specified time
Measure anomaly detection data asset conform to an expected pattern or not
Measure validity are all data valid or not according to domain business
Service web service restful service accessing data assets
Web UI ui page web page to explore apache griffin features
Connector spark connector execute jobs in spark cluster
Schedule schedule schedule measure jobs on different clusters


2017.04 batch accuracy onboard

  • Week01: headless batch accuracy measure

    • headless batch accuracy measure use case onboard.
    • headless batch accuracy measure usage document.
  • Week02: batch accuracy measure with service

    • release batch accuracy measure with service enabled.
    • end2end headless workable use case, including guidance, metrics report.
    • prepare data in hive, explore data asset from ui, generate accuracy measure in ui, trigger accuracy measure in script.
  • Week03: batch accuracy measure with UI Page

    • UI Page refine: remove ‘create data asset’
    • end2end ui enabled workable use case.
    • prepare data in hive, explore data asset from ui, generate accuracy measure in ui, trigger accuracy measure in script.
  • Week04: release batch accuracy measure with UI, Service, Scheduler, Measure.

    • end to end full pipeline use case enabled.

2017.05 streaming accuracy P2

2017.06 streaming accuracy onboard P2

2017.07 schedule P4

2017.08 profiling P3

2017.09 completeness P2

2017.10 timeliness P2

2017.11 anomaly detection P3

2017.12 validity P3

Release Notes

2017.03.30 release streaming measures

Weekly updates

well planed and scalable

priority/epic/story/breakdown to backlog task.

3 measures