This document summarizes a presentation on architecting data platforms given at the Strata Data Conference in London 2018. The presentation discusses building a customer 360 view using streaming vehicle and other IoT data. It outlines the requirements to support real-time querying, batch processing, and analytics. The high-level architecture shown includes data sources, streaming pipelines, storage systems, and processing engines. Key challenges discussed are reliably ingesting multiple data types and scaling to support various workloads and access patterns.
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
This document discusses the high-level architecture for a data platform to support a customer 360 view using data from connected vehicles (taxis). The architecture includes data sources, streaming data ingestion using Kafka, schema validation, stream processing for transformations and routing, and storage for analytics, search and long-term retention. The presentation covers design considerations for reliability, scalability and processing of both streaming and batch data to meet requirements like querying, visualization, and batch processing of historical data.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
This document discusses the high-level architecture for a data platform to support a customer 360 view using data from connected vehicles (taxis). The architecture includes data sources, streaming data ingestion using Kafka, schema validation, stream processing for transformations and routing, and storage for analytics, search and long-term retention. The presentation covers design considerations for reliability, scalability and processing of both streaming and batch data to meet requirements like querying, visualization, and batch processing of historical data.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
Top 5 mistakes when writing Streaming applicationshadooparchbook
This document discusses 5 common mistakes when writing streaming applications and provides solutions. It covers: 1) Not shutting down apps gracefully by using thread hooks or external markers to stop processing after batches finish. 2) Assuming exactly-once semantics when things can fail at multiple points requiring offsets and idempotent operations. 3) Using streaming for everything when batch processing is better for some goals. 4) Not preventing data loss by enabling checkpointing and write-ahead logs. 5) Not monitoring jobs by using tools like Spark Streaming UI, Graphite and YARN cluster mode for automatic restarts.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
This document provides an overview and summary of a presentation on integrating Oracle GoldenGate and Apache Kafka for real-time data streaming. It introduces the speaker, describes Rittman Mead as a specialist in Oracle data integration and analytics, and outlines the challenges of integrating new data sources. The bulk of the document then dives into a step-by-step example of using GoldenGate to replicate transactional data from an Oracle database to Kafka in real-time via Kafka's publish-subscribe capabilities.
Flink in Zalando's world of Microservices ZalandoHayley
Apache Flink Meetup at Zalando Technology, May 2016
By Javier Lopez & Mihail Vieru, Zalando
In this talk we present Zalando's microservices architecture and introduce Saiki – our next generation data integration and distribution platform on AWS. We show why we chose Apache Flink to serve as our stream processing framework and describe how we employ it for our current use cases: business process monitoring and continuous ETL. We then have an outlook on future use cases.
The document discusses best practices for streaming applications. It covers common streaming use cases like ingestion, transformations, and counting. It also discusses advanced streaming use cases that involve machine learning. The document provides an overview of streaming architectures and compares different streaming engines like Spark Streaming, Flink, Storm, and Kafka Streams. It discusses when to use different storage systems and message brokers like Kafka for ingestion pipelines. The goal is to understand common streaming use cases and their architectures.
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
This document outlines a presentation on architectural considerations for Hadoop applications. It introduces the presenters who are experts from Cloudera and contributors to Apache Hadoop projects. It then discusses a case study on clickstream analysis, how this was challenging before Hadoop due to data storage limitations, and how Hadoop provides a better solution by enabling active archiving of large volumes and varieties of data at scale. Finally, it covers some of the challenges in implementing Hadoop, such as choices around storage managers, data modeling and file formats, data movement workflows, metadata management, and data access and processing frameworks.
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
The concept of "Data Lake" is in everyone's mind today. The idea of storing all the data that accumulates in a company in a central location and making it available sounds very interesting at first. But Data Lake can quickly turn from a clear, beautiful mountain lake into a huge pond, especially if it is inexpertly entrusted with all the source data formats that are common in today's enterprises, such as XML, JSON, CSV or unstructured text data. Who, after some time, still has an overview of which data, which format and how they have developed over different versions? Anyone who wants to help themselves from the Data Lake must ask themselves the same questions over and over again: what information is provided, what data types do they have and how has the content changed over time?
Data serialization frameworks such as Apache Avro and Google Protocol Buffer (Protobuf), which enable platform-independent data modeling and data storage, can help. This talk will discuss the possibilities of Avro and Protobuf and show how they can be used in the context of a data lake and what advantages can be achieved. The support on Avro and Protobuf by Big Data and Fast Data platforms is also a topic.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
This document discusses interactive data visualization powered by Spark streaming. It describes how Zoomdata allows users to visualize streaming data in real-time as new data is delivered. The key challenges of streaming data like time, frequency, retention and synchronization are addressed. Zoomdata receives streaming data via Kafka or JMS, processes it using Spark Streaming in a single JVM, and stores the data in buffers like MongoDB. This allows for interactive data visualizations that update in real-time as new streaming data is processed. The document also outlines technologies used, how the system scales out, benefits, and includes a demo of streaming data from Twitter to MemSQL and Solr sinks using Spark Streaming.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
Using Customer 360 and the internet of things as examples, this tutorial explains how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
Top 5 mistakes when writing Streaming applicationshadooparchbook
This document discusses 5 common mistakes when writing streaming applications and provides solutions. It covers: 1) Not shutting down apps gracefully by using thread hooks or external markers to stop processing after batches finish. 2) Assuming exactly-once semantics when things can fail at multiple points requiring offsets and idempotent operations. 3) Using streaming for everything when batch processing is better for some goals. 4) Not preventing data loss by enabling checkpointing and write-ahead logs. 5) Not monitoring jobs by using tools like Spark Streaming UI, Graphite and YARN cluster mode for automatic restarts.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
This document provides an overview and summary of a presentation on integrating Oracle GoldenGate and Apache Kafka for real-time data streaming. It introduces the speaker, describes Rittman Mead as a specialist in Oracle data integration and analytics, and outlines the challenges of integrating new data sources. The bulk of the document then dives into a step-by-step example of using GoldenGate to replicate transactional data from an Oracle database to Kafka in real-time via Kafka's publish-subscribe capabilities.
Flink in Zalando's world of Microservices ZalandoHayley
Apache Flink Meetup at Zalando Technology, May 2016
By Javier Lopez & Mihail Vieru, Zalando
In this talk we present Zalando's microservices architecture and introduce Saiki – our next generation data integration and distribution platform on AWS. We show why we chose Apache Flink to serve as our stream processing framework and describe how we employ it for our current use cases: business process monitoring and continuous ETL. We then have an outlook on future use cases.
The document discusses best practices for streaming applications. It covers common streaming use cases like ingestion, transformations, and counting. It also discusses advanced streaming use cases that involve machine learning. The document provides an overview of streaming architectures and compares different streaming engines like Spark Streaming, Flink, Storm, and Kafka Streams. It discusses when to use different storage systems and message brokers like Kafka for ingestion pipelines. The goal is to understand common streaming use cases and their architectures.
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
This document outlines a presentation on architectural considerations for Hadoop applications. It introduces the presenters who are experts from Cloudera and contributors to Apache Hadoop projects. It then discusses a case study on clickstream analysis, how this was challenging before Hadoop due to data storage limitations, and how Hadoop provides a better solution by enabling active archiving of large volumes and varieties of data at scale. Finally, it covers some of the challenges in implementing Hadoop, such as choices around storage managers, data modeling and file formats, data movement workflows, metadata management, and data access and processing frameworks.
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
The concept of "Data Lake" is in everyone's mind today. The idea of storing all the data that accumulates in a company in a central location and making it available sounds very interesting at first. But Data Lake can quickly turn from a clear, beautiful mountain lake into a huge pond, especially if it is inexpertly entrusted with all the source data formats that are common in today's enterprises, such as XML, JSON, CSV or unstructured text data. Who, after some time, still has an overview of which data, which format and how they have developed over different versions? Anyone who wants to help themselves from the Data Lake must ask themselves the same questions over and over again: what information is provided, what data types do they have and how has the content changed over time?
Data serialization frameworks such as Apache Avro and Google Protocol Buffer (Protobuf), which enable platform-independent data modeling and data storage, can help. This talk will discuss the possibilities of Avro and Protobuf and show how they can be used in the context of a data lake and what advantages can be achieved. The support on Avro and Protobuf by Big Data and Fast Data platforms is also a topic.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
This document discusses interactive data visualization powered by Spark streaming. It describes how Zoomdata allows users to visualize streaming data in real-time as new data is delivered. The key challenges of streaming data like time, frequency, retention and synchronization are addressed. Zoomdata receives streaming data via Kafka or JMS, processes it using Spark Streaming in a single JVM, and stores the data in buffers like MongoDB. This allows for interactive data visualizations that update in real-time as new streaming data is processed. The document also outlines technologies used, how the system scales out, benefits, and includes a demo of streaming data from Twitter to MemSQL and Solr sinks using Spark Streaming.
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.
Presented at the Oracle Technology Network Virtual Technology Summit February/March 2015.
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
Using Customer 360 and the internet of things as examples, this tutorial explains how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
3 Things to Learn:
-How data is driving digital transformation to help businesses innovate rapidly
-How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business
-How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Databricks
One of the biggest challenges which customers face is how to productionize machine learning for enterprises. Once the Data scientist, Data Engineers, Business analyst, Machine learning engineer have successfully built their Machine Learning Models, they need model management a system that manages and orchestrates the entire lifecycle of machine learning models.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience.
In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.
We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
We started out processing big data using AWS S3, EMR clusters, and Athena to serve Analytics data extracts to Tableau BI.
However as our data and teams sizes increased, Avro schemas from source data evolved, and we attempted to serve analytics data through Web apps, we hit a number of limitations in the AWS EMR, Glue/Athena approach.
This is a story of how we scaled out our data processing and boosted team productivity to meet our current demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.
We will describe lessons learnt and best practices established as we enabled our teams with DataBricks autoscaling Job clusters and Notebooks and migrated our Avro/Parquet data to use MetaStore, SQL Endpoints and SQLA Console, while charting the path to the Delta lake…
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
This document discusses a presentation on fraud detection application architectures using Hadoop. It provides an overview of different fraud use cases and challenges in implementing Hadoop-based solutions. Requirements for the applications include handling high volumes, velocities and varieties of data, generating real-time alerts with low latency, and performing both stream and batch processing. A high-level architecture is proposed using Hadoop, HBase, HDFS, Kafka and Spark to meet the requirements. Storage layer choices and considerations are also discussed.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
The document summarizes key considerations for managing successful data projects, including understanding the problem, selecting appropriate software, managing risk, building effective teams, and architecting maintainable solutions. It covers major data project types like data pipelines, processing, and applications. It also discusses evaluating and selecting data management solutions by considering factors like solution lifecycles, tipping points, demand, fit, visibility, and risks. The overall goal is to provide foundations for architecting successful data solutions.
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
The document discusses foundations for successful data projects. It covers understanding the key data project types including data pipelines, data processing and analysis, and application development. It discusses considerations and risks for each type as well as ideal team makeup. The document also covers evaluating and selecting data solutions, discussing solution lifecycles and tipping point considerations like mavericks, connectors, and salespeople who can help drive adoption.
Presented at IDEAS SoCal on Oct 20, 2018. I discuss main approaches of deploying data science engines to production and provide sample code for the comprehensive approach of real time scoring with MLeap and Spark ML.
This document provides an overview of big data and the Spark framework. It discusses the big data ecosystem, including file systems, data ingestion tools, batch and real-time data processing frameworks, visualization tools, and support technologies. It outlines common big data job roles and their associated skills. The document then focuses on Spark, describing its core functionality, modules like DataFrames and MLlib, and execution modes. It provides guidance on learning Spark, emphasizing programming skills and Spark APIs. A demo of Spark fundamentals on a big data lab is also proposed.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Similar to Architecting a Next Gen Data Platform – Strata London 2018 (20)
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
This document describes a talk on interfacing Hadoop and R for distributed data analysis. It introduces Hadoop and R, discusses options for running R on Hadoop's distributed platform including the authors' prototypes, and provides an example use case of analyzing airline on-time performance data using Hadoop Streaming and R code. The authors are data engineers from Orbitz who have built prototypes for user segmentation and analyzing airline and hotel booking data on Hadoop using R.
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
The document discusses how Orbitz Worldwide integrated Hadoop into its enterprise data infrastructure to handle large volumes of web analytics and transactional data. Some key points:
- Orbitz used Hadoop to store and analyze large amounts of web log and behavioral data to improve services like hotel search. This allowed analyzing more data than their previous 2-week data archive.
- They faced initial resistance but built a Hadoop cluster with 200TB of storage to enable machine learning and analytics applications.
- The challenges now are providing analytics tools for non-technical users and further integrating Hadoop with their existing data warehouse.
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
This document summarizes a presentation on interfacing Hadoop and R for distributed data analysis. It introduces Hadoop and R, describes options for running R on Hadoop including Hadoop Streaming and Hadoop Interactive (Hive), and provides an example use case of analyzing airline on-time performance data. Key points include interfacing Hadoop and R at the cluster level to bring parallel processing capabilities to R, and using tools like Hadoop Streaming and RHIPE to allow R code to be run on Hadoop clusters.
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
This document summarizes a presentation given by Robert Lancaster and Jonathan Seidman about how their company, Orbitz, is extending their enterprise data warehouse with Hadoop. They discuss how Hadoop provides scalable storage and processing of large amounts of log and web analytics data. They then provide examples of how this data is used for applications like optimizing hotel search, recommendations, and user segmentation. Finally, they outline their vision of integrating Hadoop and the data warehouse to provide a unified view for business intelligence and analytics tools.
Orbitz used Hadoop and Hive to address the challenge of processing and analyzing large amounts of log and user data. They were able to improve their hotel sorting and ranking by using machine learning algorithms on data stored in Hadoop. Statistical analysis of the Hadoop data provided insights into user behaviors and helped optimize aspects of the user experience like hotel search and recommendations. Orbitz found Hadoop to be a cost-effective solution that has expanded to more uses across the company.
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Jonathan Seidman
Using Hadoop and Hive, Orbitz analyzed large amounts of web analytics data to optimize travel search and gain insights. They loaded over 500GB of daily log data into Hadoop and used Hive to run SQL-like queries to derive metrics like the position of booked hotels in search results and booking position trends by location. Statistical analysis in R helped explore trends, correlations and outliers in the Hive datasets to help machine learning applications.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
5. Questions?
tiny.cloudera.com/ukquestions
About the presenters
▪ Director of Engineering of Global Insights at
Blizzard
▪ Cloudera Principal Solution Architect
▪ Architect at FINRA
▪ Contributor to:
▪ Apache Spark,
▪ Hadoop,
▪ Hive,
▪ Sqoop,
▪ Yarn,
▪ Flume,
▪ Etc.
Ted Malaska
21. Questions?
tiny.cloudera.com/ukquestions
Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.
- Reliable and scalable storage to support modeling and processing of multiple data
formats.
25. Questions?
tiny.cloudera.com/ukquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
File/Object
RDBMS/MPP
Time Series
Reverse Indexed
Memory
Stream
26. Questions?
tiny.cloudera.com/ukquestions
Key to Customer 360 Success
Your project is only as good as the quality and variety of data sources
Geo-location/
Traffic Data
Customer DataMaintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Files
CSV? XML?
JSON?
Twitter?
Mainframe?
Database Salesforce?
MQTT
27. Questions?
tiny.cloudera.com/ukquestions
High Level Architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
File/Object
RDBMS/MPP
Time Series
Reverse Indexed
Memory
Stream
33. Questions?
tiny.cloudera.com/ukquestions
REST Proxy
Talking to Non-native Kafka Apps and Outside the Firewall
REST Proxy
Non-Java Applications
Native Kafka Java Applications
REST / HTTP
Simplifies administrative
actions
Simplifies message
creation and consumption
Provides a RESTful
interface to a Kafka
cluster
34. Questions?
tiny.cloudera.com/ukquestions
Kafka Connect
Streaming Data Capture
JDBC
Logs
MQTT
RDBMS
Key/Value
HDFS
Kafka Connect API
Kafka
Sources Sinks
Fault tolerant
Manage hundreds of
data sources and sinks
Preserves data schema
Part of Apache Kafka
project
Includes simple
transformations
41. Questions?
tiny.cloudera.com/ukquestions
Goals for our Transport Layer
▪ To meet these goals we want some kind of publish-subscribe queue:
- Kafka
- Kinesis
- RabbitMQ
- Azure Queues
- Azure Service Bus
- Google Pub/Sub
- etc…
43. Questions?
tiny.cloudera.com/ukquestions
Buffering Data
▪ What do we mean by “buffering” and why do we need it?
event
event,event,event,event,event,event…
event
event
event
event
This is bad!
▪ Network partitions happen
▪ Producers and Consumers work at
different rates
▪ Reliable storage is hard
Stream processing is hard
Lets do one at a time
45. Questions?
tiny.cloudera.com/ukquestions
What is Kafka?
▪ It’s like a message queue, right?
- Actually, it’s a “distributed commit log”
- Or “streaming data platform”
0 1 2 3 4 5 6 7 8
Data
Source
Data
Consumer
A
Data
Consumer
B
46. Questions?
tiny.cloudera.com/ukquestions
Topics and Partitions
▪ Messages are organized into topics, and each topic is split into partitions.
- Each partition is an immutable, time-sequenced log of messages on disk.
- Note that time ordering is guaranteed within, but not across, partitions.
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Partition 0
Partition 1
Partition 2
Data
Source
Topic
50. Questions?
tiny.cloudera.com/ukquestions
Kafka Considerations – Reliability
▪ Different reliability levels for topics:
Taxi Trip Data
Kafka
taxi-trip-input
Twitter customer-sentiment
100% – dups
are ok
(“At least
once”)
<=100%
(“At most
once”)
News Flash:
Kafka’s Exactly Once
Producer is on the way
63. Questions?
tiny.cloudera.com/ukquestions
How many partitions?
▪ Adding partitions late in the game is painful
▪ Basic formula:
total desired throughput / throughput of slowest consumer or producer
▪ Or ~25GB disk space
▪ Not too many because:
- Each partition takes broker heap memory and file handles
- Each partition slows down node shutdown / recovery
- 1000 – 4000 partitions per broker max
- Producers will produce smaller batches – lower throughput
65. Questions?
tiny.cloudera.com/ukquestions
Guarding Against Message Loss
▪ Producer – What happens if the producer loses connection to Kafka and the buffer overflows?
- You get an exception. You can choose to… block? Write to file?
▪ Source – What happens if events are lost before getting sent to producer?
- Once again use some kind of buffer to provide sufficient retention of data.
70. Questions?
tiny.cloudera.com/ukquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
71. Questions?
tiny.cloudera.com/ukquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
72. Questions?
tiny.cloudera.com/ukquestions
But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency
92. Questions?
tiny.cloudera.com/ukquestions
What to do with Sessions?
- Window of events defined by a gap
- What to do with them
- Length of sessions
- Number of sessions with in a day
- Average separation between events in a session
- Common orders with in sessions
- Counts of types of events with in sessions
Session 1 Session 2 Session 3
99. Questions?
tiny.cloudera.com/ukquestions
Ingestion
- File Systems & Object Store
- Normally you want larger files
- Normally you want high compression
- Normally you want deduping
- Window of deduping
- 100% deduping in this case is difficult
- Think about sequence numbering from source
101. Questions?
tiny.cloudera.com/ukquestions
Ingestion
- NoSQL Time Series
- Dumb inserts
- You may need Aggregation across metrics
- To increase throughput you may want to buffer writes
- Context
- Row is a key time and data points There
- Row1 Time1, DataPoint1
- Row1 Time2, DataPoint2
- Is slower then
- Row1 Time1, DataPoint1, Time2, DataPoint2
103. Questions?
tiny.cloudera.com/ukquestions
Aggregation & Counting
- This is were we talk about Lambda
- There are many definitions
- Common but not correct: Jobs that involve both Batch & Streaming
- Correct: Perfect count is not possible with streaming so we use a combination of streaming and
batch to show the right value
104. Questions?
tiny.cloudera.com/ukquestions
Aggregation & Counting
- Why is streaming not perfect
Incrementing Speed Layer NoSQL
Get Event to Process
Increment
Acknowledge Event
Get Event Again
Increment
Acknowledge Event
Value: 10
Value: 12
Value: 12
Value: 12
Value: 14
Value: 14
+2
+2
109. Questions?
tiny.cloudera.com/ukquestions
Aggregation & Counting
- Failure Problem Solved by Adding internal State
Micro Batch Layer
Get Batch Y
Count Batch Y
Update StateByKey
Foreach Value Put
Acknowledge Batch
Reset State to Start of X Batch
Get Batch Y
Count Batch Y Value
Update StateByKey
Foreach Value Put
Acknowledge Batch
NoSQL
Value: 12
Value: 12
Value: 12
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Put(14)
Put(14)
111. Questions?
tiny.cloudera.com/ukquestions
Aggregation & Counting
- Deduping with Sequence Numbers
Source Sequence Value
A 1 10
B 1 100
A 2 10
B 2 100
A 3 10
B 2 100
B 3 100
Seq of A Value of A Seq of B Value of B
1 10 - -
1 10 1 100
2 20 1 100
2 20 2 200
3 30 2 200
3 30 2 200
3 30 3 300
121. Questions?
tiny.cloudera.com/ukquestions
Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve
122. Questions?
tiny.cloudera.com/ukquestions
Semantics of our architecture
At least once
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
At least once
Ordered
Partitioned It depends It depends
130. Questions?
tiny.cloudera.com/ukquestions
Spark Streaming - Gaps
▪ Not as low of a latency
- Efforts towards reducing latency e.g. RISElab’s Drizzle
▪ Global consistent execution state
- Stop overall execution of distributed computation
- Eagerly persist records in transit meaning larger snapshots
131. Questions?
tiny.cloudera.com/ukquestions
Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Cleaner APIs
▪ for (Triggering, Evicting, State management)
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)
139. Questions?
tiny.cloudera.com/ukquestions
Kafka Streams
▪ Good integration with Kafka
▪ Light-weight library (not a framework)
▪ No micro-batching, uses Kafka as internal messaging layer
▪ Maintains local state per node (in RocksDB, or in memory hash map)
▪ Handles late events
▪ Stream-to-stream joins
146. Questions?
tiny.cloudera.com/ukquestions
Basic of GFS => HDFS
- NameNode
- Metadata of all the files/blocks
- Which data node they are assigned too
- Replication management
- Data Nodes
- Metadata for each block location on disk
147. Questions?
tiny.cloudera.com/ukquestions
Basic of GFS => HDFS
Client
NameNode
DataNode A
DataNode B
DataNode C
A
B
C
Write Path
A. Ask Name Node for Location to Write
B. Write to DataNode with NN Instructions
C. DataNode does replication
D. Confirms file is persisted to client
148. Questions?
tiny.cloudera.com/ukquestions
Basic of GFS => HDFS
- File are immutable
- File can be of any type
- Files are block up into Blocks (128MB -> 1GB)
- Metadata cost is at the Block not the data size
- File may be splittable or may not be when reading
150. Questions?
tiny.cloudera.com/ukquestions
Object Store (Like and not like HDFS)
- Like a HDFS
- Contains files
- Break up large files
- Not like a HDFS
- Not really a file system and is more Key value like a NoSQL
- Doesn’t have any metadata limit problem
- Traversing Folder directories is more work
- There is no rename, only copy and delete
- Eventually consist issues with listing files
- (seen with things like MR and Spark)
- Can be mostly addressed with EMRFS
152. Questions?
tiny.cloudera.com/ukquestions
Object Store (Thinking Remote)
- Unlike HDFS the storage is always remote
- Not on the same nodes as the execution
- Which allow you to save money in the cloud
- Execution nodes are expensive vs storage only
- Network will be used to Read and Write
- In fact you are normally throttled well before the network limit of your node
- You will want the highest rates of compression possible
- To save money on storage
- To read and write faster
156. Questions?
tiny.cloudera.com/ukquestions
Compression Codecs
- Snappy: 2x-3x : Fast Read, Fast Write
- Lzo: 2x-3x : Fast Read, Fast Write
- Gzip: ~8x: ~Fast Read, Normal Write
- Default: ~8x: ~Fast Read, Normal Write
- BZip2: ~10x ~Fast Read, Slow Write
- Others ..
- Always be skeptical
- All data compresses differently
- Use your own data
157. Questions?
tiny.cloudera.com/ukquestions
Introducing the Hive Metastore
- Hive Metastore
- Adds a table like metadata layer over a file system, block store, NoSql, or other
- Allows for SQL access
- Allows for greater security options
- Allows for external metadata
- Allows for partitioning
161. Questions?
tiny.cloudera.com/ukquestions
Thinking about Object/Tables
1. Lets start off easy
1. Use Case: We are a Netflix type company and we have a log of users and movies watched
that looks something like this:
User ID Age Account Start
Date
Category Of User Movie Watched Movie Category Start Time Events List
Bob 42 12/12/2012 Basic Die Hard Action 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
Kat 31 12/12/2012 Platum Beauty and the
Beast
Family 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
162. Questions?
tiny.cloudera.com/ukquestions
Thinking about Object/Tables
1. To make this into objects we need to do some separation
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
163. Questions?
tiny.cloudera.com/ukquestions
Query Considerations
- Data is normally big so
- Partition respectively to access patterns
- Join with care
- Consider sampling or local testing before experimenting
- Data is files
- Latency to accessibility it high – seconds, minutes or more.
164. Questions?
tiny.cloudera.com/ukquestions
Look for big tables
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
167. Questions?
tiny.cloudera.com/ukquestions
View Strategies
Hive Relational Model
Hive Nested Model
Models
Hive Normal Views
Hive Materialized Table
Views
Use in the cases where the view requires
a join that is done through a shuffle
Use only for tables that filter
records/columns or use for marking
fields
169. Questions?
tiny.cloudera.com/ukquestions
Nested
▪ Less Space than Denormalization
▪ Still have tables but the cost of joins is all but gone
▪ Also great for cartesian joins
- N x M vs N + M
▪ Not really supported yet with Kudu or HBase with SQL
170. Questions?
tiny.cloudera.com/ukquestions
Nested Example
CREATE TABLE fact_contacts (id BIGINT, name STRING, address
STRING) STORED AS PARQUET;
CREATE TABLE dim_phones
(
contact_id BIGINT
, category STRING
, international_code STRING
, area_code STRING
, exchange STRING
, extension STRING
, mobile BOOLEAN
, carrier STRING
, current BOOLEAN
, service_start_date TIMESTAMP
, service_end_date TIMESTAMP
)
172. Questions?
tiny.cloudera.com/ukquestions
De-normalized vs Nested
- Nested Pros
- Co-location
- Faster to group by
- Faster to window
- Joins are free
- Less data
- Better compression
- Tables and Columns can be read with out penalty from one not read
- Great for limiting the effort are Cartesian Joins
- Nested Cons
- Size limitation of parent row
- Adding child requires the re-write the the whole parent record
192. Questions?
tiny.cloudera.com/ukquestions
Hash Map
- There is a Key and a Value
- It is really fast to grab a key/value
- It is really fast to add a key/value
- Iteration is also possible
Key Value
A 1
B 1
C 1
D 1
E 1
F 1
G 1
Client
193. Questions?
tiny.cloudera.com/ukquestions
Log with Compactions
- When new records come in they don’t rewrite the old
- They compact in
Key Time Value
A 1 101
B 1 101
C 1 101
D 1 101
E 1 101
F 1 101
G 1 101
Key Time Value
A 2 102
D 2 102
F 2 102
F 3 103
H 3 103
Key Time Value
A 2 102
B 1 101
C 1 101
D 2 102
E 1 101
F 3 103
G 1 101
H 3 103
194. Questions?
tiny.cloudera.com/ukquestions
HDFS
Log with Compactions
- Write Path
- Get Local for Record (Cached)
- First to WAL
- Then to Memstore
- Sorting & batching
- Flush to New Hfile
- Later Hfiles will be compacted
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
WAL
195. Questions?
tiny.cloudera.com/ukquestions
HDFS
Ordered
- All Records Columns are ordered
- Ordering allows for simpler indexing
- Ordering allows for simpler compactions
- We will also use this ordering
- Windowing
- Time series
- Local scanning
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
197. Questions?
tiny.cloudera.com/ukquestions
So what about SQL
- Well SQL could totally work
- CQL for cassandra
- Hive and SparkSQL on Hbase & Cassandra
- Why is it not the best idea
- Built more for point look ups
- Scans are not as fast as parquet
- However the mutability may be more important than speed
- Partitioning is not simple
- It must be put into the key
199. Questions?
tiny.cloudera.com/ukquestions
HBase Model
Client
Master
Region Server 1
Region Server 2
- Region Server owns range splits
- Region Server 1 fails
- Master needs to figure that out
- Master needs to assign new Region Server to own splits
- Region Server 2 has to get organized
- Region Server 2 is read to server reads and writes
207. Questions?
tiny.cloudera.com/ukquestions
What do they share (CAP theorem)
Consistency
Availability
Partitioned
StrongConsistence
Eventual Consistence
Doesn’tExist
- Cheating the CAP Theorem
- Cassandra is a good model
- Where they expand the definition of failure
with variable consistence
- CAP still holds but …
210. Questions?
tiny.cloudera.com/ukquestions
Lucene Indexing (Facets)
- Facets are a side effect of out wonderful indexes
- It allows us to counts all the document that below to given indexes to produce
- Grouped Counts
- Charts and Graphs (kibana or Banana)
- People will also call this access pattern cubing a dataset
212. Questions?
tiny.cloudera.com/ukquestions
Lucene Indexing (Facets Example)
- Time Series Example
Document
ID
Hour of Day User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
213. Questions?
tiny.cloudera.com/ukquestions
Lucene Indexing (Facets Example)
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Document
ID
Hour of
Day
User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
9 2 4204 CA click
User
4201 1 4 8
4202 2 5
4203 3
4204 6 9
4205 7
State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Event
click 1 2 3 4 6 8 9
view 5 7
216. Questions?
tiny.cloudera.com/ukquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
217. Questions?
tiny.cloudera.com/ukquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
218. Questions?
tiny.cloudera.com/ukquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1
VA
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 MD
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
220. Questions?
tiny.cloudera.com/ukquestions
Writing Latency
- Lucene Indexing is more expensive then NoSQL work
- Think of it as micro batching
- Larger batches ~= better throughput
- Compaction is also invalid
- Deletes impact storage and performance until they are compacted
225. Questions?
tiny.cloudera.com/ukquestions
BSP Bulk Synchronous Parallel
- Process every Node Atomically
- Node gets all messages sent to it
- Nodes can mutate them selves and their edges
- Nodes can send messages to other nodes
- But nothing is received yet
- BSP waits until all the Node processing is done
- Then send messages to the right partition
- Repeat
228. Questions?
tiny.cloudera.com/ukquestions
Kudu
1. Replace the Region Servers with Tablet Servers
2. Replace block format HFile files with a parquet like TFiles
3. Replace the byte array focused HBase API with one that is more JDBC friendly
4. Tight integration with Spark SQL and Impala for SQL
5. Completely rewrite the compaction process to make for perfectly sized files with our having major
compactions but always manageable micro compactions.
233. Questions?
tiny.cloudera.com/ukquestions
Druid.IO
Client
Broker Cluster
Broker
Node
Broker
Node
Broker
Node
Real Time Cluster
Real Time
Node
Real Time
Node
Real Time
Node
History Cluster
History
Node
History
Node
History
Node
Pluggable
Storage
Batch Ingestion
Streaming
Ingestion
ZookKeeper
Metadata Storage
Query Planning and
Response Preparation
Try to optimize read
path based on time
request of the query
Async large batching
Main Ingestion Path
Short TTL for hot
memory cache
House keeping services
237. Questions?
tiny.cloudera.com/ukquestions
Why have batch processing?
▪ When you need a larger context
- Say, to train a model
▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles
▪ For example,
- Kudu -> HDFS Nested is batch processing
- KMeans calculation, etc.
249. Questions?
tiny.cloudera.com/ukquestions
REST Servers
import org.mortbay.jetty.Server
import org.mortbay.jetty.servlet.{Context, ServletHolder}
…
val server = new Server(port)
val sh = new ServletHolder(classOf[ServletContainer])
sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",
"com.sun.jersey.api.core.PackagesResourceConfig")
sh.setInitParameter("com.sun.jersey.config.property.packages",
"com.hadooparchitecturebook.taxi360.server.hbase")
sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)
val context = new Context(server, "/", Context.SESSIONS)
context.addServlet(sh, "/*”)
server.start()
server.join()
255. Questions?
tiny.cloudera.com/ukquestions
SQL engine criteria
▪ Low latency SQL access
▪ Allows for high concurrency
▪ JDBC/ODBC integration
▪ Capable of large scale aggregation
▪ Optionally integrates with Multiple Storage Systems for real-time updates to SQL
tables