(1) Amundsen is a data discovery platform developed by Lyft to help users find, understand, and use data.
(2) The platform addresses challenges around data discovery such as lack of understanding about what data exists and where to find it.
(3) Amundsen provides searchable metadata about data resources, previews of data, and usage statistics to help data scientists and others explore and understand data.
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
Tamika Tannis gave a presentation on Lyft's open source data discovery tool called Amundsen. She discussed Lyft's data ecosystem and the challenges of data discovery. Amundsen addresses these challenges through search, metadata, and visualization capabilities powered by a graph database backend. The tool has been hugely successful at Lyft and an active open source community is contributing to its ongoing development and new features.
Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
Tamika Tannis gave a presentation on Lyft's open source data discovery tool called Amundsen. She discussed Lyft's data ecosystem and the challenges of data discovery. Amundsen addresses these challenges through search, metadata, and visualization capabilities powered by a graph database backend. The tool has been hugely successful at Lyft and an active open source community is contributing to its ongoing development and new features.
Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
This document provides an overview of Scala data pipelines at Spotify. It discusses:
- The speaker's background and Spotify's scale with over 75 million active users.
- Spotify's music recommendation systems including Discover Weekly and personalized radio.
- How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms.
- Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets.
- The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...InfluxData
The document discusses how an easy-to-use and fast database can have a complicated implementation for developers. It outlines four key areas: 1) Flexible writing schema requires schema merging at read time. 2) Fast reads prune non-covered data chunks through predicate push-down. 3) Loading duplicated data necessitates data deduplication and compaction operations. 4) Quick data deletion still needs data elimination at read time or in the background. The document provides examples to illustrate the tradeoffs between user and developer requirements.
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
With the expansion of big data and analytics, organizations are looking to incorporate data streaming into their business processes to make real-time decisions.
Join this webinar as we guide you through the buzz around data streams:
- Market trends in stream processing
- What is stream processing
- How does stream processing compare to traditional batch processing
- High and low volume streams
- The possibilities of working with data streaming and the benefits it provides to organizations
- The importance of spatial data in streams
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Igor Anishchenko
Odessa Java TechTalks
Lohika - May, 2012
Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
This document discusses using real-time analytics applications with LinkedIn activity data and Apache Pinot. It provides three examples of use cases: 1) article analytics to understand reader demographics, 2) feed ranking to improve relevance, and 3) anomaly detection for monitoring metrics and detecting issues. It compares performance of Pinot to other real-time analytics databases and processing engines. Finally, it outlines an architecture for building analytics applications and dashboards using Pinot to enable real-time insights from large-scale activity data.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Following Matei Zaharia’s keynote presentation, join this session for the nitty gritty details. Tableau is joining forces with Databricks and the Delta Lake open source community to announce Delta Sharing and the new open Delta Sharing protocol for secure data sharing. For Tableau customers, Delta Sharing simplifies and enriches data, while supporting the development of a data culture. Join this session to see a live demo of Tableau on Delta Sharing. Tableau customers can choose between 2 workflows for connection. The first workflow is called “Direct Connect,” which leverages a Tableau WDC connector. The second workflow involves using a hybrid approach for querying live on the Delta Sharing protocol and using Tableau Hyper in-memory data engine for fast data ingestion and analytical query processing.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Modularized ETL Writing with Apache SparkDatabricks
Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.
Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.
These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.
Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.
And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.
Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
This paper, written by the LinkedIn Espresso Team, appeared at the ACM SIGMOD/PODS Conference (June 2013). To see the talk given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn), go here:
http://www.slideshare.net/amywtang/li-espresso-sigmodtalk
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Amundsen is a metadata-driven application developed by Lyft to solve data discovery challenges. It provides a search-based UI and uses a distributed architecture with various microservices to index and serve metadata from multiple sources. Key components include a metadata service using Neo4j, a search service using Elasticsearch, and a frontend. The tool has been hugely successful at Lyft and is now open source. Future work includes expanding metadata coverage and integrating with other tools.
This document provides an overview of Amundsen, an open source data discovery and metadata platform developed by Lyft. It begins with an introduction to the challenges of data discovery and outlines Amundsen's architecture, which uses a graph database and search engine to provide metadata about data resources. The document discusses how Amundsen impacts users at Lyft by reducing time spent searching for data and discusses the project's community and future roadmap.
This document provides an overview of Scala data pipelines at Spotify. It discusses:
- The speaker's background and Spotify's scale with over 75 million active users.
- Spotify's music recommendation systems including Discover Weekly and personalized radio.
- How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms.
- Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets.
- The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...InfluxData
The document discusses how an easy-to-use and fast database can have a complicated implementation for developers. It outlines four key areas: 1) Flexible writing schema requires schema merging at read time. 2) Fast reads prune non-covered data chunks through predicate push-down. 3) Loading duplicated data necessitates data deduplication and compaction operations. 4) Quick data deletion still needs data elimination at read time or in the background. The document provides examples to illustrate the tradeoffs between user and developer requirements.
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
With the expansion of big data and analytics, organizations are looking to incorporate data streaming into their business processes to make real-time decisions.
Join this webinar as we guide you through the buzz around data streams:
- Market trends in stream processing
- What is stream processing
- How does stream processing compare to traditional batch processing
- High and low volume streams
- The possibilities of working with data streaming and the benefits it provides to organizations
- The importance of spatial data in streams
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Igor Anishchenko
Odessa Java TechTalks
Lohika - May, 2012
Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
This document discusses using real-time analytics applications with LinkedIn activity data and Apache Pinot. It provides three examples of use cases: 1) article analytics to understand reader demographics, 2) feed ranking to improve relevance, and 3) anomaly detection for monitoring metrics and detecting issues. It compares performance of Pinot to other real-time analytics databases and processing engines. Finally, it outlines an architecture for building analytics applications and dashboards using Pinot to enable real-time insights from large-scale activity data.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Following Matei Zaharia’s keynote presentation, join this session for the nitty gritty details. Tableau is joining forces with Databricks and the Delta Lake open source community to announce Delta Sharing and the new open Delta Sharing protocol for secure data sharing. For Tableau customers, Delta Sharing simplifies and enriches data, while supporting the development of a data culture. Join this session to see a live demo of Tableau on Delta Sharing. Tableau customers can choose between 2 workflows for connection. The first workflow is called “Direct Connect,” which leverages a Tableau WDC connector. The second workflow involves using a hybrid approach for querying live on the Delta Sharing protocol and using Tableau Hyper in-memory data engine for fast data ingestion and analytical query processing.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Modularized ETL Writing with Apache SparkDatabricks
Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.
Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.
These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.
Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.
And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.
Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
This paper, written by the LinkedIn Espresso Team, appeared at the ACM SIGMOD/PODS Conference (June 2013). To see the talk given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn), go here:
http://www.slideshare.net/amywtang/li-espresso-sigmodtalk
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Amundsen is a metadata-driven application developed by Lyft to solve data discovery challenges. It provides a search-based UI and uses a distributed architecture with various microservices to index and serve metadata from multiple sources. Key components include a metadata service using Neo4j, a search service using Elasticsearch, and a frontend. The tool has been hugely successful at Lyft and is now open source. Future work includes expanding metadata coverage and integrating with other tools.
This document provides an overview of Amundsen, an open source data discovery and metadata platform developed by Lyft. It begins with an introduction to the challenges of data discovery and outlines Amundsen's architecture, which uses a graph database and search engine to provide metadata about data resources. The document discusses how Amundsen impacts users at Lyft by reducing time spent searching for data and discusses the project's community and future roadmap.
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
The document discusses Lyft's data discovery tool called Amundsen. It provides an overview of Amundsen's architecture including its use of a graph database and Elasticsearch for metadata storage and search. It describes the challenges of data discovery that Amundsen addresses like time spent searching for data. The document outlines Amundsen's key components like its databuilder, metadata and search services. It discusses Amundsen's impact and popularity at Lyft and its open source community. Future roadmap plans include additional metadata types and deeper integrations with other tools.
Democratizing Data within your organization - Data DiscoveryMark Grover
n this talk, we talk about the challenges at scale in an organization like Lyft. We delve into data discovery as a challenge towards democratizing data within your organization. And, go in detail about the solution to solve the challenge of data discovery.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
Slides from Enterprise Search & Analytics Meetup @ Cisco Systems - http://www.meetup.com/Enterprise-Search-and-Analytics-Meetup/events/220742081/
Relevancy and Search Quality Analysis - By Mark David and Avi Rappoport
The Manifold Path to Search Quality
To achieve accurate search results, we must come to an understanding of the three pillars involved.
1. Understand your data
2. Understand your customers’ intent
3. Understand your search engine
The first path passes through Data Analysis and Text Processing.
The second passes through Query Processing, Log Analysis, and Result Presentation.
Everything learned from those explorations feeds into the final path of Relevancy Ranking.
Search quality is focused on end users finding what they want -- technical relevance is sometimes irrelevant! Working with the short head (very frequent queries) has the most return on investment for improving the search experience, tuning the results, for example, to emphasize recent documents or de-emphasize archive documents, near-duplicate detection, exposing diverse results in ambiguous situations, using synonyms, and guiding search via best bets and auto-suggest. Long-tail analysis can reveal user intent by detecting patterns, discovering related terms, and identifying the most fruitful results by aggregated behavior. all this feeds back into the regression testing, which provides reliable metrics to evaluate the changes.
By merging these insights, you can improve the quality of the search overall, in a scalable and maintainable fashion.
Serving tens of billions of personalized recommendations a day under a latency of 30 milliseconds is a challenge. In this talk I'll share our algorithmic architecture, including its Spark-based offline layer, and its Elasticsearch-based serving layer, that enable running complex models under difficult scale constrains and shorten the cycle between research and production.
Sonya Liberman leads the Personalization team @ Outbrain's Recommendations group, developing large-scale machine learning algorithms for Outbrain's content recommendations platform serving tens of billions real-time recommendations a day. She specializes in Information Retrieval, Machine Learning, and Computational Linguistics. Before joining Outbrain, she led the Research and Algorithms @ ConvertMedia (acquired by Taboola). She holds an MSc in Computer Science and a BSc in Computer Science and Computational Biology.
This invited talk was given at PyData Meetup, April 2019
https://www.meetup.com/PyData-Tel-Aviv/
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
This document discusses big data tools and trends that enable real-time business intelligence from machine logs. It provides an overview of Perficient, a leading IT consulting firm, and introduces the speakers Eric Roch and Ben Hahn. It then covers topics like what constitutes big data, how machine data is a source of big data, and how tools like Hadoop, Storm, Elasticsearch can be used to extract insights from machine data in real-time through open source solutions and functional programming approaches like MapReduce. It also demonstrates a sample data analytics workflow using these tools.
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
The document discusses strategies for dealing with messy and imperfect data when creating websites. It describes how the Chembio Hub uses techniques like automatically tagging untagged data using significant terms analysis in Elasticsearch and creating database views to normalize different schemas. Filling gaps in tagging by querying search engines and considering flat file databases are also proposed. The goal is to enable sharing of chemical and biological research data across Oxford departments in a sustainable way without requiring perfect data formats or extensive curation.
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
Watch the full session: Denodo DataFest 2016 sessions: https://goo.gl/Bvmvc9
Data prep and data blending are terms that have come to prominence over the last year or two. On the surface, they appear to offer functionality similar to data virtualization…but there are important differences!
In this session, you will learn:
• How data virtualization complements or contrasts technologies such as data prep and data blending
• Pros and cons of functionality provided by data prep, data catalog and data blending tools
• When and how to use these different technologies to be most effective
This session is part of the Denodo DataFest 2016 event. You can also watch more Denodo DataFest sessions on demand here: https://goo.gl/VXb6M6
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
The document outlines an agenda for a Neo4j Graph Day event including sessions on connected data, graphs and artificial intelligence, a lunch break, Neo4j training, and a reception. Key topics include Neo4j in production environments, its role in boosting artificial intelligence, and training opportunities.
Recommender Systems @ Scale, Big Data Europe Conference 2019Sonya Liberman
Serving tens of billions of personalized recommendations a day under a latency of 30 milliseconds is a challenge. In this talk I’ll share our algorithmic architecture, including its Spark-based offline layer, and its Elasticsearch-based serving layer, that enable running complex models under difficult scale constrains and shorten the cycle between research and production.
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
The document provides an agenda for a workshop on building enterprise-ready knowledge graph applications in the cloud. The workshop will cover understanding knowledge graphs and related technologies, setting up a knowledge graph architecture on Amazon Neptune for scalable storage and querying, and using the metaphactory platform to rapidly build applications and APIs. Attendees will learn concepts for maintaining, querying and searching knowledge graphs, and building end-user and developer applications on top of knowledge graphs. The tutorial will include hands-on demonstrations and exercises to set up a small knowledge graph application.
The document describes analyzing clickstream data using Spark clustering algorithms. It discusses parsing raw clickstream data containing user agent strings to extract useful fields like device type, OS, and timezone. It then applies distributed K-modes clustering in Spark to group users into subsets exhibiting similar patterns. The results show 10 clusters with different proportions of country, timezone, device and other fields, revealing distinct user types within the dataset.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
This document discusses web data extraction and analysis using Hadoop. It begins by explaining that web data extraction involves collecting data from websites using tools like web scrapers or crawlers. Next, it describes that the data extracted is often large in volume and requires processing tools like Hadoop for analysis. The document then provides details about using MapReduce on Hadoop to analyze web data in a parallel and distributed manner by breaking the analysis into mapping and reducing phases.
Similar to Data council sf amundsen presentation (20)
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
1) Lyft uses Airflow for ETL workflows to move data from mobile apps and events to data warehouses.
2) Lyft has customized Airflow with features like UI auditing, DAG dependency graphs, and integrating Amundsen for data lineage.
3) Current focuses at Lyft include an ETL expiration system, upgrading DAGs to Python 3, and leveraging new Airflow features in a multi-tenant cluster.
Tao Feng gave a presentation on Airflow at Lyft. Some key points:
1) Lyft uses Apache Airflow for ETL workflows with over 600 DAGs and 800 DAG runs daily across three AWS Auto Scaling Groups of worker nodes.
2) Lyft has customized Airflow with additional UI links, DAG dependency graphs, and integration with internal tools.
3) Lyft is working to improve the backfill experience, support DAG-level access controls, and explore running Airflow with Kubernetes executors.
4) Tao discussed challenges like daylight saving time issues and long-running tasks occupying slots, and thanked other Lyft engineers contributing to Airflow.
This document presents the ODP (On-Demand Profiling) framework, which is a scalable and language-independent system for profiling microservices. The framework allows users to remotely profile microservices to identify performance bottlenecks. It includes APIs for starting/stopping profiling, profiler plugins for different languages/platforms, messaging for transmitting profiling data, and a GUI for visualizing and comparing results. The goal is to make profiling easier to use at scale for microservices-based architectures.
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
This document discusses two models for performing multi-stream joins in Apache Samza: All-In-One (AIO) joining and Step-By-Step (SBS) joining. It analyzes the performance tradeoffs of each model in terms of memory usage, latency, and other metrics. The authors implemented both models in Samza and evaluated their performance to determine the best approach for different scenarios.
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
This document proposes a memory capacity model for Apache Samza to optimize deployment density and reduce costs for data filtering applications. The model calculates the minimum required heap size based on workload characteristics like input topics, partitions, unique entries. It considers Samza's internal data structures and memory usage. Evaluation on a real LinkedIn use case found the model accurately predicts heap sizes, allowing maximum application colocation while maintaining performance and data quality.
This document presents a memory capacity model for optimizing data filtering applications built on the Samza framework. The model estimates the live data set size based on application parameters like input topics, partitions, and message sizes. It then sets the required heap size to twice the live data size to avoid garbage collection issues. Evaluation shows the model accurately predicts memory usage, allowing more Samza containers per node while maintaining service level agreements.
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
This document summarizes benchmarking tests of Apache Samza's performance processing streaming data. The tests measured Samza's performance on different processing tasks: message passing achieved 1.2 million messages per second per node; key counting with an in-memory store achieved 1 million messages per second; key counting with RocksDB storage was 443k messages per second; and key counting with RocksDB storage and changelog was 300k messages per second. The benchmarks provide a foundation for developing a capacity model for Samza's performance on high-volume streaming data applications.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
6. • My first project is to analyze and predict Data council Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
6
7. • Option 1: Phone a friend!
• Option 2: Github search
Status quo
7
8. • What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
8
10. Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
10
11. Data Scientists spend upto 1/3rd time in Data Discovery...
11
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
13. Data Discovery - User personas
13
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
14. 3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
15. Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
36. 36
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
37. 37
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
46. 46
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
47. 3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
47
49. How to make the search result more relevant?
49
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
54. Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
54
56. Pull model vs. Push model
56
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
62. Amundsen seems to be more useful than what we
thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
64
63. Impact - Amundsen at Lyft
65
Beta release
(internal)
Generally Available
(GA) release
Alpha release
64. Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
66. Summary
• Data Discovery adds 30+% more productivity to Data Scientists
• Metadata is key to the next wave of big data applications
• Amundsen - Lyft’s metadata and data discovery platform
• Blog post with more details: go.lyft.com/datadiscoveryblog
68
67. Jin Hyuk Chang | @jinhyukchang
Tao Feng | @feng-tao
Slides at go.lyft.com/amundsen_datacouncil_2019
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/ 69
Today’s agenda:
Why empowering with data is important…
What are we doing in the data team at Lyft (context)...
What challenges we are facing and have seen other companies face…
How are we solving the problem...
At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
Who is our audience: everyone who works at Lyft…
Power users: Data Scientists, Research Scientists, Product Managers…
Next: Engineers, GMs, Ops, etc.
What does the architecture for our core infra look like?
Mobile application primarily…
Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
Mark
Mark
Mark
Mark
Data Discovery: How much of a challenge is it? Significant challenge…
Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis…
We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis…
But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth…
We can significantly increase productivity and impact if we can reduce this time...
Mark
Mark
Popularity is not click through rate but through query access patterns.
Amundsen architecture at Lyft:
3 micro-services(FE, metadata, search) and one generic data ingestion framework
Will discuss each of the component in details
High level walkthrough….how CCPA compilance works
ML features (one sentence on what is feature service)
Add a logo of neo4j
? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
Why graph db is the best option? Why not rdbms, nosql etc
Why choose neo4j vs other graph db? (most popular graph db)
Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data.
Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah
There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all.
We model the graph as bi-direction relationships.
? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
? challenge
How to improve search relevance?
Think of search algorithm…
Event_ride_table -> event ride table
Talk about how we measure it: instrument empty results from user, % of click through rate
What we did? Boost the rank of exact table name match
Walk through pros and cons
Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin
Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table.
Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that.
Loader writes data into either sink (destination) or into staging area.
Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
Today’s agenda:
Why empowering with data is important…
What are we doing in the data team at Lyft (context)...
What challenges we are facing and have seen other companies face…
How are we solving the problem...
At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
A slide on Amundsen @ Lyft?
? how long it has been in prod
How many datasets
Users
WAU
Usage
Kafka topic
Schema registry
ML workflow and Airflow DAGs
Mark
***feel free to edit based on season;
No need to divide by location, can also divide by department