SlideShare a Scribd company logo
1 of 36
END-TO-END DATA GOVERNANCE
WITH APACHE AVRO AND ATLAS
Barbara Eckman, Ph.D.
Principal Data Architect
Comcast
Mission
Gather, organize, make sense of Comcast data,
and make it universally accessible through
Platforms, Solutions, Products.
• Dozens of tenants and stakeholders
• Millions of messages/second captured
• Tens of PB of long term data storage
• Thousands of cores of distributed compute
Motivating example
• Record representing an IP Video player download session.
– Time download session created
– List of timestamps when video download started
– Endtime, status (failure, success)
– Device info
– Customer Account info
– Asset info (ie movie, tv show, etc being downloaded)
– Fragments: total, completed
– Failure events
• Suppose I want to integrate this data with :
– Customer experience data -- join on customer account
– IP player analytics session (latency, buffering) -- join on device, asset id
– Comcast network traffic data -- join on device
Data integration shop of horrors!
• Customer account ids:
– Xcal
– Xbo
– Billing
– “Service”
• Device ids:
– Physical
– DeviceId
– Xcal
– Xbo
• Asset ids:
– ProviderId, assetId
– streamId
– recordingId
– EAS_URI
– mediaGuid
– mediaId
– assetContentId
– programId
– platformId
Data and Schema
Governance to the Rescue!
Questions: Schema Governance
• I want to join your data with my data—but
what does your data mean???
• Is it safe to join my data with yours (Avoid
“Frankendata”)?
• What attributes in your data match attributes
in mine? (ie potential join fields)
• Can I change my schema without breaking
systems of others that rely on it?
Questions: Data Governance
• Where can I find data about X?
– How is this data structured?
– Who produced it?
• How has the data changed in its journey
from ingest to where I’m viewing it?
• Where are the derivatives of my original
data to be found?
Outline
• Avro for Schema Governance
– What is Apache Avro
– How Comcast BPs make Avro even stronger
• Atlas for Data Governance
– Metadata Browser and Schema registry
– Comcast extensions to Atlas
– Platforms and Atlas work together for lineage
Apache Avro
• A data serialization system
– A JSON-based schema language
– A compact serialized format
• APIs in a bunch of languages
• Benefits:
– Cross-language support for dynamic data access
– Simple but expressive schema definition and
evolution
Avro Types
• primitive
– string
– bytes
– int & long
– float & double
– boolean
– Null (enables
optional fields)
• complex
– record
– array
– map: string -> Type
– union (aka “choice”)
– fixed<N> (byte arrays,
eg uuid)
– enum:
[“larry”|“moe”|“curly”]
Simple Avro Example—Generic
Application Log Event
{ "name": "loggerName",
"type": "string“,
"default": “unknown”,
"doc": "The name of the logger class that
generated the message.”},
{ "name": "message",
"type": ["null", "string“],
"default": null,
"doc": "The formatted message of the log event."}
I want to change my schema…will it
mess others up??
• AVRO serialized data is never used without its
schema.
• We have empirically determined a complete
set of rules for schema evolution and
backward compatibility, for example:
– You may add an optional field to your schema
– You may not change the type of any field
I want to change my schema…will it
mess others up??
• Some chains of compatible changes produce
incompatible results
– remove optional string field X
– add new optional int field X
– This sequence has changed the type of field X
from string to int!
• We can use this data in reasoning about
schema evolution, e.g. “is this schema
compatible with the schema 3 versions back?”
Avro Schema Creation Best Practices
• Each schema is checked for compliance with
Comcast conventions on compile
• Each schema is reviewed and approved by at
least one human being
• Comcast conventions:
– doc comments required to document every
attribute
– All attributes must have default values
– Unnecessary complexity is discouraged (YAGNI
principle)
Avro Schema Creation Best Practices
Data governance policy on updates:
– Data must always match a schema in the schema
registry or be traceable to such a schema
– Updates to schemas of data “in flight” or “at rest”
are not permitted, though re-publication of
enriched data is permitted.
– Schemas in the registry may be evolved at will, but
non-compatible changes should be kept to a bare
minimum
AND MOST IMPORTANTLY…
Philosophy of modular core
subschema reuse
• Github repo of core subschemas
– application (running on a device)
– customerAccount
– device
– error
– geolocation
– header (timestamp, uuid, hostname)
– logEvent
– moneyTrace (ie distributed message tracing)
– monitoringEvent
– networkInterface
– more added as needed
Recap: Avro benefits
• Benefits to the Business
– Data integration to answer business questions
• Benefits to data and schema governance
– Data standardization
– Data integrity
– Standardized documentation
– Data lineage
– Clear compatibility criteria
Avro in the context of data governance:
Apache Atlas
User-Facing Solutions
and Products
DEAP
Analytics
PERPlatforms
Data Engineering Analytics Platform
Architecture
Athene noctua, “little owl”
Companion of Athena, Greek goddess of Wisdom,
Justice, and Good Governance.
• Where can I find data about
X?
• How has the data changed
in its journey from ingest to
where I’m viewing it?
• Where are the derivatives
of my original data to be
found?
• Can I control who
sees/changes my data?
(esp. PII)
Questions: Data Governance
Data Discovery
Data Lineage
Data Security
Apache Atlas
• Data Discovery, Lineages
– Browser UI
– Rest and Java APIs
– Synchronous and Asynchonous messaging
• Integrated Security (Apache Ranger)
• Schema Registry as well as Metadata Repo
Open Source
Extensible
Atlas Extensions for Schema Registry
• New typedefs: avro_schema, avro_record,
avro_enum, etc
• Extensions to Kafka topic type
– sizing parameters
• Reciprocal links between topics and schemas
• Schema evolution
– Versioning
– schema lineage process
– B/F/Full compatibility
• Better handling of JSON-valued attributes in UI
Avro Schema Registry for Data at
Ingest
• Browsable hierarchical schemas
• Avro schemas corresponding to kafka topics
(or no kafka topics)
• Schema lineage using Atlas processes
• Schema evolution and compatibility
• Kafka topic search capabilities
• Contact person for each kafka topic/schema
• API for fully-expanded avro schemas for serde
Data Lineage for Data at Rest
• Compressed avro binary files in long-term object
storage
• Hive or parquet tables derived directly from avro
data
• Data derived/aggregated from hive tables
• Avro-serialized JSON data in NoSQL
• etc
For any data item in any community data store, we
must be able to identify the corresponding avro
schema in the registry
Atlas Extensions for Data Lineage
• S3 objects (also linked to schemas)
• Hive tables and avro schemas linked
• Link to slack help channel in UI
• Java library to facilitate creation and linking of
entities, using asynchronous messaging
– create S3pseudo, link it to avro
– Update kafka sizing
batch data
with
avro schemas
ATHENE: schemas, metadata, lineage
HEADWATERS:
stream data
with
avro schemas
CLOUDBRIDGE:
transformation
creates
Stream data
Parquet file;
Lineage links
Hive table;
Lineage links
OBJECT
STORAGE
Platforms work together for data
governance
S3 pseudodir;
Lineage links
Demo here
Visualizing Data Lineage
What’s Next
• More types
– Independent kafka2S3 service that messages Atlas
– More transforms as first-class processes
• Generic transform library with avro schemas expressing
inputs and outputs
– Zeppelin and other notebooks
– Other data sources in our hybrid environment
• Circonus checks
Summary
• Avro for Schema Governance
– Our lingua franca for end-to-end metadata
– Comcast BPs make Avro even stronger
• Atlas for Data Governance
– Metadata Browser and Schema registry
– Comcast typedef extensions to Atlas
• Avro schemas, records, enums, maps, arrays, etc
• Add attributes to already existing types (eg kafka topics)
• Data sources outside Hadoop ecosystem
• Transforms, schema versioning, lineage, compatibility
– Platforms and Atlas work together for lineage
Challenges and Solutions
Challenge Solution
Creating conformant avro schemas is not trivial Detailed documentation, sample code in Java,
Python, C#, GO, etc; team of reviewers
Avro schemas are annoying to create in a text
editor
Avro schema builder UI—in beta now
Avro manual schema review process was
originally too slow
Trained more reviewers, streamlined processes;
Avro schema reviewer UI—in beta now
Not everyone can publish data in Avro, but they
can in JSON
Use Apache NiFi to convert from JSON to JSON-
serialized avro
No one likes to be “governed”;
Some developers saw creating Avro schemas as a
drag on their productivity
Top-down mandate; came to recognize the
benefits of a schema in data handling:
excitement from analysts, end-user applications
Finding data governance tooling to handle
metadata and lineage for our hybrid environment
Apache Atlas: Open source, extensible solution!
Need a schema registry that allows browsing,
highlights our BPs, also serves up avro for serde
Atlas: Create avro_schema type, extend kafka
topic type, represent schema evolution
Informing Atlas of changes where there aren’t
built-in “hooks”
Extensibility!! Java library hides Atlas-specific
syntax, makes it easy for other platforms
My collaborators
Vadim Vaks
Sr. Solutions Architect
Hortonworks
Thank You!
Barbara_Eckman@Cable.Comcast.com

More Related Content

What's hot

Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz
 
Migrating with Debezium
Migrating with DebeziumMigrating with Debezium
Migrating with DebeziumMike Fowler
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...confluent
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLDatabricks
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practicesconfluent
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 

What's hot (20)

Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Migrating with Debezium
Migrating with DebeziumMigrating with Debezium
Migrating with Debezium
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 

Similar to End-to-end Data Governance with Apache Avro and Atlas

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsCarole Goble
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEPBIOVIA
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...HostedbyConfluent
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWSTom Laszewski
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceChristian Beedgen
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 

Similar to End-to-end Data Governance with Apache Avro and Atlas (20)

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics Service
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

End-to-end Data Governance with Apache Avro and Atlas

  • 1. END-TO-END DATA GOVERNANCE WITH APACHE AVRO AND ATLAS Barbara Eckman, Ph.D. Principal Data Architect Comcast
  • 2. Mission Gather, organize, make sense of Comcast data, and make it universally accessible through Platforms, Solutions, Products. • Dozens of tenants and stakeholders • Millions of messages/second captured • Tens of PB of long term data storage • Thousands of cores of distributed compute
  • 3. Motivating example • Record representing an IP Video player download session. – Time download session created – List of timestamps when video download started – Endtime, status (failure, success) – Device info – Customer Account info – Asset info (ie movie, tv show, etc being downloaded) – Fragments: total, completed – Failure events • Suppose I want to integrate this data with : – Customer experience data -- join on customer account – IP player analytics session (latency, buffering) -- join on device, asset id – Comcast network traffic data -- join on device
  • 4. Data integration shop of horrors! • Customer account ids: – Xcal – Xbo – Billing – “Service” • Device ids: – Physical – DeviceId – Xcal – Xbo • Asset ids: – ProviderId, assetId – streamId – recordingId – EAS_URI – mediaGuid – mediaId – assetContentId – programId – platformId
  • 5. Data and Schema Governance to the Rescue!
  • 6. Questions: Schema Governance • I want to join your data with my data—but what does your data mean??? • Is it safe to join my data with yours (Avoid “Frankendata”)? • What attributes in your data match attributes in mine? (ie potential join fields) • Can I change my schema without breaking systems of others that rely on it?
  • 7. Questions: Data Governance • Where can I find data about X? – How is this data structured? – Who produced it? • How has the data changed in its journey from ingest to where I’m viewing it? • Where are the derivatives of my original data to be found?
  • 8. Outline • Avro for Schema Governance – What is Apache Avro – How Comcast BPs make Avro even stronger • Atlas for Data Governance – Metadata Browser and Schema registry – Comcast extensions to Atlas – Platforms and Atlas work together for lineage
  • 9. Apache Avro • A data serialization system – A JSON-based schema language – A compact serialized format • APIs in a bunch of languages • Benefits: – Cross-language support for dynamic data access – Simple but expressive schema definition and evolution
  • 10. Avro Types • primitive – string – bytes – int & long – float & double – boolean – Null (enables optional fields) • complex – record – array – map: string -> Type – union (aka “choice”) – fixed<N> (byte arrays, eg uuid) – enum: [“larry”|“moe”|“curly”]
  • 11. Simple Avro Example—Generic Application Log Event { "name": "loggerName", "type": "string“, "default": “unknown”, "doc": "The name of the logger class that generated the message.”}, { "name": "message", "type": ["null", "string“], "default": null, "doc": "The formatted message of the log event."}
  • 12. I want to change my schema…will it mess others up?? • AVRO serialized data is never used without its schema. • We have empirically determined a complete set of rules for schema evolution and backward compatibility, for example: – You may add an optional field to your schema – You may not change the type of any field
  • 13. I want to change my schema…will it mess others up?? • Some chains of compatible changes produce incompatible results – remove optional string field X – add new optional int field X – This sequence has changed the type of field X from string to int! • We can use this data in reasoning about schema evolution, e.g. “is this schema compatible with the schema 3 versions back?”
  • 14. Avro Schema Creation Best Practices • Each schema is checked for compliance with Comcast conventions on compile • Each schema is reviewed and approved by at least one human being • Comcast conventions: – doc comments required to document every attribute – All attributes must have default values – Unnecessary complexity is discouraged (YAGNI principle)
  • 15. Avro Schema Creation Best Practices Data governance policy on updates: – Data must always match a schema in the schema registry or be traceable to such a schema – Updates to schemas of data “in flight” or “at rest” are not permitted, though re-publication of enriched data is permitted. – Schemas in the registry may be evolved at will, but non-compatible changes should be kept to a bare minimum AND MOST IMPORTANTLY…
  • 16. Philosophy of modular core subschema reuse • Github repo of core subschemas – application (running on a device) – customerAccount – device – error – geolocation – header (timestamp, uuid, hostname) – logEvent – moneyTrace (ie distributed message tracing) – monitoringEvent – networkInterface – more added as needed
  • 17. Recap: Avro benefits • Benefits to the Business – Data integration to answer business questions • Benefits to data and schema governance – Data standardization – Data integrity – Standardized documentation – Data lineage – Clear compatibility criteria
  • 18. Avro in the context of data governance: Apache Atlas
  • 21. Athene noctua, “little owl” Companion of Athena, Greek goddess of Wisdom, Justice, and Good Governance.
  • 22. • Where can I find data about X? • How has the data changed in its journey from ingest to where I’m viewing it? • Where are the derivatives of my original data to be found? • Can I control who sees/changes my data? (esp. PII) Questions: Data Governance Data Discovery Data Lineage Data Security
  • 23. Apache Atlas • Data Discovery, Lineages – Browser UI – Rest and Java APIs – Synchronous and Asynchonous messaging • Integrated Security (Apache Ranger) • Schema Registry as well as Metadata Repo Open Source Extensible
  • 24. Atlas Extensions for Schema Registry • New typedefs: avro_schema, avro_record, avro_enum, etc • Extensions to Kafka topic type – sizing parameters • Reciprocal links between topics and schemas • Schema evolution – Versioning – schema lineage process – B/F/Full compatibility • Better handling of JSON-valued attributes in UI
  • 25. Avro Schema Registry for Data at Ingest • Browsable hierarchical schemas • Avro schemas corresponding to kafka topics (or no kafka topics) • Schema lineage using Atlas processes • Schema evolution and compatibility • Kafka topic search capabilities • Contact person for each kafka topic/schema • API for fully-expanded avro schemas for serde
  • 26. Data Lineage for Data at Rest • Compressed avro binary files in long-term object storage • Hive or parquet tables derived directly from avro data • Data derived/aggregated from hive tables • Avro-serialized JSON data in NoSQL • etc For any data item in any community data store, we must be able to identify the corresponding avro schema in the registry
  • 27. Atlas Extensions for Data Lineage • S3 objects (also linked to schemas) • Hive tables and avro schemas linked • Link to slack help channel in UI • Java library to facilitate creation and linking of entities, using asynchronous messaging – create S3pseudo, link it to avro – Update kafka sizing
  • 28. batch data with avro schemas ATHENE: schemas, metadata, lineage HEADWATERS: stream data with avro schemas CLOUDBRIDGE: transformation creates Stream data Parquet file; Lineage links Hive table; Lineage links OBJECT STORAGE Platforms work together for data governance S3 pseudodir; Lineage links
  • 30.
  • 32. What’s Next • More types – Independent kafka2S3 service that messages Atlas – More transforms as first-class processes • Generic transform library with avro schemas expressing inputs and outputs – Zeppelin and other notebooks – Other data sources in our hybrid environment • Circonus checks
  • 33. Summary • Avro for Schema Governance – Our lingua franca for end-to-end metadata – Comcast BPs make Avro even stronger • Atlas for Data Governance – Metadata Browser and Schema registry – Comcast typedef extensions to Atlas • Avro schemas, records, enums, maps, arrays, etc • Add attributes to already existing types (eg kafka topics) • Data sources outside Hadoop ecosystem • Transforms, schema versioning, lineage, compatibility – Platforms and Atlas work together for lineage
  • 34. Challenges and Solutions Challenge Solution Creating conformant avro schemas is not trivial Detailed documentation, sample code in Java, Python, C#, GO, etc; team of reviewers Avro schemas are annoying to create in a text editor Avro schema builder UI—in beta now Avro manual schema review process was originally too slow Trained more reviewers, streamlined processes; Avro schema reviewer UI—in beta now Not everyone can publish data in Avro, but they can in JSON Use Apache NiFi to convert from JSON to JSON- serialized avro No one likes to be “governed”; Some developers saw creating Avro schemas as a drag on their productivity Top-down mandate; came to recognize the benefits of a schema in data handling: excitement from analysts, end-user applications Finding data governance tooling to handle metadata and lineage for our hybrid environment Apache Atlas: Open source, extensible solution! Need a schema registry that allows browsing, highlights our BPs, also serves up avro for serde Atlas: Create avro_schema type, extend kafka topic type, represent schema evolution Informing Atlas of changes where there aren’t built-in “hooks” Extensibility!! Java library hides Atlas-specific syntax, makes it easy for other platforms
  • 35. My collaborators Vadim Vaks Sr. Solutions Architect Hortonworks

Editor's Notes

  1. 10s of Millions of Customers 100s of Millions of Devices
  2. I want to join your data - Most of the time in analytics is spent gathering, integrating, understanding data (“futzing with data”). Is it safe to join my data with yours (Avoid “Frankendata”)? -is my device_id the same as your “device_id”? -Without common schemas, data integration is nearly impossible Can I change my schema without breaking systems of others that rely on it? -Schemas can change in unpredictable and non-backward compliant ways
  3. Serialized in either binary or JSON APIs in: Java, Python, Go, C, C++, C#, …