Lessons learned while taking Presto from alpha to production at Twitter. Presented at the Presto meetup at Facebook on 2015.03.22.
Video: https://www.facebook.com/prestodb/videos/531276353732033/
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
A TRUE STORY ABOUT DATABASE ORCHESTRATIONInfluxData
During this talk, Gianluca will share the architecture of the project, describe the criticalities of the infrastructure and how the team strives to make this powerful service secure, fast, and reliable for all customers using InfluxCloud.
Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI
This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.
The document summarizes a workshop agenda for new InfluxData practitioners. It outlines the schedule of presentations and topics to be covered throughout the day-long workshop, including installing and querying the TICK stack, chronograf dashboarding, writing queries, architecting InfluxEnterprise, optimizing the TICK stack, and downsampling data. The final presentation on downsampling data is given by Michael DeSa and covers the concepts of downsampling, why it is useful, and how to perform it in InfluxDB using continuous queries and Kapacitor.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
A True Story About Database OrchestrationInfluxData
Gianluca shared the architecture of the project, described the criticalities of the infrastructure and how the team strives to make this powerful service secure, fast, and reliable for all customers using InfluxCloud.
Lessons learned while taking Presto from alpha to production at Twitter. Presented at the Presto meetup at Facebook on 2015.03.22.
Video: https://www.facebook.com/prestodb/videos/531276353732033/
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
A TRUE STORY ABOUT DATABASE ORCHESTRATIONInfluxData
During this talk, Gianluca will share the architecture of the project, describe the criticalities of the infrastructure and how the team strives to make this powerful service secure, fast, and reliable for all customers using InfluxCloud.
Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI
This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.
The document summarizes a workshop agenda for new InfluxData practitioners. It outlines the schedule of presentations and topics to be covered throughout the day-long workshop, including installing and querying the TICK stack, chronograf dashboarding, writing queries, architecting InfluxEnterprise, optimizing the TICK stack, and downsampling data. The final presentation on downsampling data is given by Michael DeSa and covers the concepts of downsampling, why it is useful, and how to perform it in InfluxDB using continuous queries and Kapacitor.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
A True Story About Database OrchestrationInfluxData
Gianluca shared the architecture of the project, described the criticalities of the infrastructure and how the team strives to make this powerful service secure, fast, and reliable for all customers using InfluxCloud.
Netflix running Presto in the AWS CloudZhenxiao Luo
Netflix runs Presto in its AWS cloud environment to enable low-latency ad-hoc queries on petabyte-scale data stored in S3. Some key things Netflix did include optimizing Presto to read from and write directly to S3, fixing bugs, integrating Presto with its EMR and Ganglia monitoring, and deploying a 100+ node Presto cluster that handles over 1000 queries per day. Performance testing showed Presto was often 10x faster than Hive for various queries and joins. Netflix continues optimizing Presto for its needs like supporting Parquet, ODBC/JDBC drivers, and looking to address current limitations.
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu
This presentation was delivered at the NYC SQL meetup on September 27, 2018. It provided a technical overview of the TiDB Platform, a deep dive into TiDB's MySQL compatible layer and MySQL ecosystem tools, use case of Mobike, and appendix with detail materials on coprocessor and transaction model.
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...javier ramirez
Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun.
In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.
SYNCING IN JAVASCRIPT: MULTI-CLIENT COLLABORATION THROUGH DATA SHARING (Steve...Future Insights
Presentation taken from Future of Web Apps Boston (http://futureofwebapps.com/boston-2014)
In this talk, Steve will build a system from scratch for cross-device data synchronization in JavaScript. Through demos, he will explore all the things you're probably not thinking about when rolling your own sync engine, like offline caching, change notification, and conflict resolution. Drawing on his experience from Dropbox, Steve will discuss the thorny challenges around sync and how to solve them.
WordPress RESTful API & Amazon API Gateway - WordCamp Kansai 2016崇之 清水
This document summarizes a presentation given at WordCamp Kansai 2016 about building REST APIs and microservices with Amazon API Gateway and WordPress. The presentation covered:
1. Using REST APIs with WordPress
2. Integrating WordPress with Amazon API Gateway
3. Examples of building WordPress APIs to access third party services and custom backends
The presentation provided examples of using API Gateway as a proxy for the WordPress REST API, enabling CORS, and building microservices architectures with API Gateway, Lambda, and other AWS services behind the WordPress frontend. Attendees were encouraged to explore building scalable WordPress sites and applications with REST APIs and serverless architectures on AWS.
1Spatial: Cardiff FME World Tour: Live vessel tracking - FME Cloud1Spatial
FME Cloud was used to build a solution to ingest live shipping data from an API into an ArcGIS Online map for a client within a week. An FME workspace was created to retrieve vessel positions from the MarineTraffic API every 2 minutes and write them to truncate and update an ArcGIS Online feature service. It also archived positions to a Google Fusion Table. FME Cloud monitoring and notifications ensured the solution ran continuously and issues were detected. The solution met all requirements including being live by Friday without requiring on-premise hardware or ongoing costs beyond the initial subscription period.
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
This talk shows how we can use Apache Flink and Apache Zeppelin to do interactive data analysis. The examples show the usage of FlinkML to solve a linear regression and classification problem.
The document discusses evolving schemas in NoSQL databases. It describes starting with a simple data structure and search index, then enhancing it to support dynamic filtering and cached previews without hitting the main data store. It also covers approaches for migrating data to a new format, such as adding new fields, while the system is live using techniques like versioning the data and writing upgrade functions. Finally, it recommends some lessons learned, such as that schemaless does not mean no schema, changes should be painless, and agile code needs agile data.
Docker for mac & local developer environment optimizationRadek Baczynski
Docker can be used to optimize a local development environment by providing the same environment as production. Issues with performance on Docker for Mac can be addressed through techniques like using delegated volume mounts, removing xdebug, and using a solution like mutagen that syncs files without mounted volumes for faster performance. Mutagen provides near native performance, easy setup and monitoring, and works with any dockerized application.
Kapacitor is the brains of the TICK Stack. Nathaniel will cover the stream processing capabilities of Kapacitor, how to process data before it gets stored in InfluxDB and after it is stored, best practices around anomaly detection and machine learning. In addition, Nathaniel will discuss how to configure the clustered version of Kapacitor.
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
This document discusses technologies for data analytics services for enterprise businesses. It begins by defining enterprise businesses as those "not about IT" and data analytics services as providing insights into business metrics like customer reach, ad views, purchases, and more using data. It then outlines some key technologies needed for such services, including data management systems, distributed processing systems, queues and schedulers, tools for connecting systems, and methods for controlling jobs and workflows with retries to handle failures. Specific challenges around deadlines, idempotent operations, and replay-able workflows are also addressed.
Wisely Chen Spark Talk At Spark Gathering in Taiwan Wisely chen
- The document discusses SparkSQL and Parquet as part of Appier's data pipeline. Appier uses SparkSQL with Parquet on HDFS to enable SQL queries on large datasets and support machine learning applications.
- Parquet was chosen because it has good performance, supports nested data structures, and is the preferred file format for SparkSQL. Storing data in Parquet files on HDFS provides a low-cost and scalable solution that gives Appier full control over their data.
- SparkSQL allows any Spark or SQL code to be reused across ETL, machine learning, and SQL querying applications. This makes development and maintenance more efficient for Appier's data team.
This document discusses Deezer's use of Elasticsearch for search, recommendations, and analysis of music metadata.
It provides an overview of Deezer's Elasticsearch architecture, which includes indexing over 50 million tracks from Hadoop and replicating indexes across clusters. It also discusses how Deezer queries Elasticsearch using custom analyzers, multi search APIs, and function score queries for recommendations. Finally, it describes Deezer's use of the ELK stack to analyze over 2 billion logs and metrics documents through Kibana dashboards.
This document discusses data collection and ingestion tools. It begins with an overview of data collection versus ingestion, with collection happening at the source and ingestion receiving the data. Examples of data collection tools include rsyslog, Scribe, Flume, Logstash, Heka, and Fluentd. Examples of ingestion tools include RabbitMQ, Kafka, and Fluentd. The document concludes with a case study of asynchronous application logging and challenges to consider.
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...New Relic
The document discusses Project Rubicon, a software analytics tool from New Relic. It summarizes Rubicon's ability to capture raw event data from applications, allowing users to ask complex questions. It then demonstrates how to write NRQL queries to analyze metrics like page views and custom events over time. NRQL makes it easy to aggregate large amounts of data through functions, time windows, time series, and facets. The document also provides an overview of Rubicon's architecture and how it handles billions of events through techniques like using memory efficiently and building for failure.
Muga Nishizawa discusses Embulk, an open-source bulk data loader. Embulk loads records from various sources to various targets in parallel using plugins. Treasure Data customers use Embulk to upload different file formats and data sources to their TD database. While Embulk is focused on bulk loading, TD also develops additional tools to generate Embulk configurations, manage loads over time, and scale Embulk using a MapReduce executor on Hadoop clusters for very large data loads.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
tado° Makes Your Home Environment Smart with InfluxDBInfluxData
Michal Knizek, Head of Research and Development at tado° GmbH, will share how they use InfluxData to gather data collected from their Smart Thermostat to help turn any home thermostat into a smart device. This device uses a variety of information collected (geo-location, temperature, user settings, current device functional state) to serve information to automatically control the environment temperature as well as letting users know when the device may need maintenance.
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
21-September-2021 - ApacheCon - Tuesday 17:10 UTC Apache NIFi Deep Dive 300
* https://github.com/tspannhw/EverythingApacheNiFi
* https://github.com/tspannhw/FLiP-ApacheCon2021
* https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
* https://github.com/tspannhw/FLiP-IoT
* https://github.com/tspannhw/FLiP-Energy
* https://github.com/tspannhw/FLiP-SOLR
* https://github.com/tspannhw/FLiP-EdgeAI
* https://github.com/tspannhw/FLiP-CloudQueries
* https://github.com/tspannhw/FLiP-Jetson
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
Tuesday 17:10 UTC
Apache NIFi Deep Dive 300
Timothy Spann
For Data Engineers who have flows already in production, I will dive deep into best practices, advanced use cases, performance optimizations, tips, tricks, edge cases, and interesting examples. This is a master class for those looking to learn quickly things I have picked up after years in the field with Apache NiFi in production.
This will be interactive and I encourage questions and discussions.
You will take away examples and tips in slides, github, and articles.
This talk will cover:
Load Balancing
Parameters and Parameter Contexts
Stateless vs Stateful NiFi
Reporting Tasks
NiFi CLI
NiFi REST Interface
DevOps
Advanced Record Processing
Schemas
RetryFlowFile
Lookup Services
RecordPath
Expression Language
Advanced Error Handling Techniques
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
The document discusses the challenges of processing and storing billions of data inserts per day from vehicle telematics projects. Some key points:
- The project involves receiving continuous data streams from over 500 vehicles with 2500 data points captured per vehicle per second, resulting in over 1.5 billion MySQL inserts daily.
- A message queue is used to receive the streaming data and buffer inserts to help scale processing. Additional optimizations include bulk loading data via LOAD DATA INFILE for speed.
- Sharding and splitting the data across multiple databases by vehicle and time period (weekly tables) helps improve query performance for both live and historical data access.
- Techniques like asynchronous requests, caching, and a single entry point
Netflix running Presto in the AWS CloudZhenxiao Luo
Netflix runs Presto in its AWS cloud environment to enable low-latency ad-hoc queries on petabyte-scale data stored in S3. Some key things Netflix did include optimizing Presto to read from and write directly to S3, fixing bugs, integrating Presto with its EMR and Ganglia monitoring, and deploying a 100+ node Presto cluster that handles over 1000 queries per day. Performance testing showed Presto was often 10x faster than Hive for various queries and joins. Netflix continues optimizing Presto for its needs like supporting Parquet, ODBC/JDBC drivers, and looking to address current limitations.
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu
This presentation was delivered at the NYC SQL meetup on September 27, 2018. It provided a technical overview of the TiDB Platform, a deep dive into TiDB's MySQL compatible layer and MySQL ecosystem tools, use case of Mobike, and appendix with detail materials on coprocessor and transaction model.
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...javier ramirez
Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun.
In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.
SYNCING IN JAVASCRIPT: MULTI-CLIENT COLLABORATION THROUGH DATA SHARING (Steve...Future Insights
Presentation taken from Future of Web Apps Boston (http://futureofwebapps.com/boston-2014)
In this talk, Steve will build a system from scratch for cross-device data synchronization in JavaScript. Through demos, he will explore all the things you're probably not thinking about when rolling your own sync engine, like offline caching, change notification, and conflict resolution. Drawing on his experience from Dropbox, Steve will discuss the thorny challenges around sync and how to solve them.
WordPress RESTful API & Amazon API Gateway - WordCamp Kansai 2016崇之 清水
This document summarizes a presentation given at WordCamp Kansai 2016 about building REST APIs and microservices with Amazon API Gateway and WordPress. The presentation covered:
1. Using REST APIs with WordPress
2. Integrating WordPress with Amazon API Gateway
3. Examples of building WordPress APIs to access third party services and custom backends
The presentation provided examples of using API Gateway as a proxy for the WordPress REST API, enabling CORS, and building microservices architectures with API Gateway, Lambda, and other AWS services behind the WordPress frontend. Attendees were encouraged to explore building scalable WordPress sites and applications with REST APIs and serverless architectures on AWS.
1Spatial: Cardiff FME World Tour: Live vessel tracking - FME Cloud1Spatial
FME Cloud was used to build a solution to ingest live shipping data from an API into an ArcGIS Online map for a client within a week. An FME workspace was created to retrieve vessel positions from the MarineTraffic API every 2 minutes and write them to truncate and update an ArcGIS Online feature service. It also archived positions to a Google Fusion Table. FME Cloud monitoring and notifications ensured the solution ran continuously and issues were detected. The solution met all requirements including being live by Friday without requiring on-premise hardware or ongoing costs beyond the initial subscription period.
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
This talk shows how we can use Apache Flink and Apache Zeppelin to do interactive data analysis. The examples show the usage of FlinkML to solve a linear regression and classification problem.
The document discusses evolving schemas in NoSQL databases. It describes starting with a simple data structure and search index, then enhancing it to support dynamic filtering and cached previews without hitting the main data store. It also covers approaches for migrating data to a new format, such as adding new fields, while the system is live using techniques like versioning the data and writing upgrade functions. Finally, it recommends some lessons learned, such as that schemaless does not mean no schema, changes should be painless, and agile code needs agile data.
Docker for mac & local developer environment optimizationRadek Baczynski
Docker can be used to optimize a local development environment by providing the same environment as production. Issues with performance on Docker for Mac can be addressed through techniques like using delegated volume mounts, removing xdebug, and using a solution like mutagen that syncs files without mounted volumes for faster performance. Mutagen provides near native performance, easy setup and monitoring, and works with any dockerized application.
Kapacitor is the brains of the TICK Stack. Nathaniel will cover the stream processing capabilities of Kapacitor, how to process data before it gets stored in InfluxDB and after it is stored, best practices around anomaly detection and machine learning. In addition, Nathaniel will discuss how to configure the clustered version of Kapacitor.
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
This document discusses technologies for data analytics services for enterprise businesses. It begins by defining enterprise businesses as those "not about IT" and data analytics services as providing insights into business metrics like customer reach, ad views, purchases, and more using data. It then outlines some key technologies needed for such services, including data management systems, distributed processing systems, queues and schedulers, tools for connecting systems, and methods for controlling jobs and workflows with retries to handle failures. Specific challenges around deadlines, idempotent operations, and replay-able workflows are also addressed.
Wisely Chen Spark Talk At Spark Gathering in Taiwan Wisely chen
- The document discusses SparkSQL and Parquet as part of Appier's data pipeline. Appier uses SparkSQL with Parquet on HDFS to enable SQL queries on large datasets and support machine learning applications.
- Parquet was chosen because it has good performance, supports nested data structures, and is the preferred file format for SparkSQL. Storing data in Parquet files on HDFS provides a low-cost and scalable solution that gives Appier full control over their data.
- SparkSQL allows any Spark or SQL code to be reused across ETL, machine learning, and SQL querying applications. This makes development and maintenance more efficient for Appier's data team.
This document discusses Deezer's use of Elasticsearch for search, recommendations, and analysis of music metadata.
It provides an overview of Deezer's Elasticsearch architecture, which includes indexing over 50 million tracks from Hadoop and replicating indexes across clusters. It also discusses how Deezer queries Elasticsearch using custom analyzers, multi search APIs, and function score queries for recommendations. Finally, it describes Deezer's use of the ELK stack to analyze over 2 billion logs and metrics documents through Kibana dashboards.
This document discusses data collection and ingestion tools. It begins with an overview of data collection versus ingestion, with collection happening at the source and ingestion receiving the data. Examples of data collection tools include rsyslog, Scribe, Flume, Logstash, Heka, and Fluentd. Examples of ingestion tools include RabbitMQ, Kafka, and Fluentd. The document concludes with a case study of asynchronous application logging and challenges to consider.
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...New Relic
The document discusses Project Rubicon, a software analytics tool from New Relic. It summarizes Rubicon's ability to capture raw event data from applications, allowing users to ask complex questions. It then demonstrates how to write NRQL queries to analyze metrics like page views and custom events over time. NRQL makes it easy to aggregate large amounts of data through functions, time windows, time series, and facets. The document also provides an overview of Rubicon's architecture and how it handles billions of events through techniques like using memory efficiently and building for failure.
Muga Nishizawa discusses Embulk, an open-source bulk data loader. Embulk loads records from various sources to various targets in parallel using plugins. Treasure Data customers use Embulk to upload different file formats and data sources to their TD database. While Embulk is focused on bulk loading, TD also develops additional tools to generate Embulk configurations, manage loads over time, and scale Embulk using a MapReduce executor on Hadoop clusters for very large data loads.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
tado° Makes Your Home Environment Smart with InfluxDBInfluxData
Michal Knizek, Head of Research and Development at tado° GmbH, will share how they use InfluxData to gather data collected from their Smart Thermostat to help turn any home thermostat into a smart device. This device uses a variety of information collected (geo-location, temperature, user settings, current device functional state) to serve information to automatically control the environment temperature as well as letting users know when the device may need maintenance.
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
21-September-2021 - ApacheCon - Tuesday 17:10 UTC Apache NIFi Deep Dive 300
* https://github.com/tspannhw/EverythingApacheNiFi
* https://github.com/tspannhw/FLiP-ApacheCon2021
* https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
* https://github.com/tspannhw/FLiP-IoT
* https://github.com/tspannhw/FLiP-Energy
* https://github.com/tspannhw/FLiP-SOLR
* https://github.com/tspannhw/FLiP-EdgeAI
* https://github.com/tspannhw/FLiP-CloudQueries
* https://github.com/tspannhw/FLiP-Jetson
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
Tuesday 17:10 UTC
Apache NIFi Deep Dive 300
Timothy Spann
For Data Engineers who have flows already in production, I will dive deep into best practices, advanced use cases, performance optimizations, tips, tricks, edge cases, and interesting examples. This is a master class for those looking to learn quickly things I have picked up after years in the field with Apache NiFi in production.
This will be interactive and I encourage questions and discussions.
You will take away examples and tips in slides, github, and articles.
This talk will cover:
Load Balancing
Parameters and Parameter Contexts
Stateless vs Stateful NiFi
Reporting Tasks
NiFi CLI
NiFi REST Interface
DevOps
Advanced Record Processing
Schemas
RetryFlowFile
Lookup Services
RecordPath
Expression Language
Advanced Error Handling Techniques
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
The document discusses the challenges of processing and storing billions of data inserts per day from vehicle telematics projects. Some key points:
- The project involves receiving continuous data streams from over 500 vehicles with 2500 data points captured per vehicle per second, resulting in over 1.5 billion MySQL inserts daily.
- A message queue is used to receive the streaming data and buffer inserts to help scale processing. Additional optimizations include bulk loading data via LOAD DATA INFILE for speed.
- Sharding and splitting the data across multiple databases by vehicle and time period (weekly tables) helps improve query performance for both live and historical data access.
- Techniques like asynchronous requests, caching, and a single entry point
A general introduction to Spring Data / Neo4JFlorent Biville
Spring Data Neo4j provides a framework for mapping graph data to Java objects and interacting with Neo4j from Spring applications. It allows defining entities as nodes and relationships and provides repositories with built-in CRUD operations. Queries can be written using Cypher or the template API. This reduces boilerplate code and provides a familiar Spring programming model for graph databases.
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
The document describes the Neo4j graph database and platform vision. It discusses key components like index-free adjacency, ACID transactions, clustering, and hardware optimizations. It outlines use cases for graph analytics, transactions, AI, and data integration. It also covers drivers, APIs, visualization, and administration tools. Finally, it previews upcoming innovations in Neo4j 3.4 like geospatial support, native string indexes, and rolling upgrades.
Case Study: VF Corporation Takes a Practical Approach to Improving its MOJO w...CA Technologies
VF Corporation uses CA Application Performance Management (APM) to monitor their ecommerce system called MOJO. With minimal customizations to APM, VF has been able to improve MOJO's performance, minimize downtime, and deliver useful performance data to application teams. Some key results include reduced average response times from 30 to under 15 seconds after a data center upgrade, preventing downtime from a growing directory issue, and correlating batch job runs to system impacts.
Data science for infrastructure dev week 2022ZainAsgar1
The document discusses using data science and automation for infrastructure monitoring. It introduces Pixie, a tool that allows users to collect raw data, transform it into signals, and then take actions based on those signals. Two examples are provided: 1) detecting SQL injections from application logs and sending Slack alerts, and 2) automatically scaling a deployment based on HTTP request throughput metrics. Pixie uses an embedded domain-specific language called PxL to define logical data workflows and queries.
Data Labs supports LINE services by performing high-level data analysis and machine learning model development using their Hadoop data lake. The machine learning lifecycle involves many steps beyond just model training, including data collection, preprocessing, deployment, and monitoring. LINE's platform provides the necessary infrastructure to efficiently perform each step of the lifecycle, allowing for rapid continuous development and experimentation through tools like HDFS, Kubernetes, Jupyter notebooks, and CI/CD pipelines.
Near real-time anomaly detection at Lyftmarkgrover
Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
This document discusses various techniques for finding and exploiting vulnerabilities during a penetration test when vulnerabilities are marked as "low" or "medium" in severity. It argues that penetration testers and clients should not rely solely on vulnerability scanners and should thoroughly investigate even lower severity issues. Specific techniques mentioned include exploiting default credentials on services like VNC, exploiting exposed admin interfaces found through tools like Metasploit, taking advantage of browsable directories with backups or other sensitive files, exploiting SharePoint misconfigurations, exploiting HTTP PUT or WebDAV configurations, exploiting Apple Filing Protocol, and exploiting trace.axd to view request details in .NET applications. The document emphasizes finding overlooked vulnerabilities and keeping "a human in the mix" rather than full reliance
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses Flink's capabilities in supporting streaming, batch, and iterative processing natively through a streaming dataflow model. It also describes Flink's architecture including the client, job manager, task managers, and various execution setups like local, remote, YARN, and embedded. Finally, it compares Flink to other stream and batch processing systems in terms of their APIs, fault tolerance guarantees, and strengths.
The devops approach to monitoring, Open Source and Infrastructure as Code StyleJulien Pivotto
Monitoring is critical for every decent application that runs on production. Many of the monitoring tools widely used show their limits at the age of Infrastructure as Code and Cloud computing. Let's investigate how monitoring can face the new challenges: scalability, reproducability and automation
The document discusses improvements made to Apache Flink by Alibaba, called Blink. Blink provides a unified SQL layer for both batch and streaming processes. It supports features like UDF/UDTF/UDAGG, stream-stream joins, windowing, and retraction. Blink also improves Flink's runtime to be more reliable and production-quality when running on large YARN clusters. It has a new architecture using a JobMaster and TaskExecutors. Checkpointing and state management were optimized for incremental backups. Blink has been running in production supporting many of Alibaba's critical systems and processing massive amounts of data.
This document provides an overview of Neo4j's vision and roadmap. It discusses Neo4j's goal of being a modern, enterprise data platform that can power both operational and analytical workloads. Key aspects of Neo4j's strategy include building a fully cloud-native database designed for operational and analytical graph workloads, with autonomous clustering to provide unlimited horizontal scalability. The document also briefly reviews recent Neo4j releases and highlights some new features like graph pattern matching and change data capture.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
This document discusses Python packaging tools like setuptools and pip. It notes that setuptools is the core API that most packaging tools use for building, packaging, metadata, and dependency management. Pip is an implementation of the setuptools programming interface and is useful for finding, installing, and managing dependencies. The document recommends using Gradle as a build orchestrator to resolve dependencies, run builds, tests, and publishing. It proposes ways to integrate Python packaging metadata with Gradle.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
The CBC machine is a common diagnostic tool used by doctors to measure a patient's red blood cell count, white blood cell count and platelet count. The machine uses a small sample of the patient's blood, which is then placed into special tubes and analyzed. The results of the analysis are then displayed on a screen for the doctor to review. The CBC machine is an important tool for diagnosing various conditions, such as anemia, infection and leukemia. It can also help to monitor a patient's response to treatment.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
2. About Me: Ryan Neal
- Head of Infrastructure at Netlify
- Simultaneously fixing and breaking everything
- Senior Dev at Yelp
- Internal tools and metrics team
- Used to about 400k metrics/sec
- 12-18k pageviews/sec
- FDE at Palantir
- Developed counter terrorist software
- 4 Billion records / day
@ry_boflavin @netlify
3. @ry_boflavin @netlify
A developer’s toolkit for deploying git-backed,
browser-driven sites to an intelligent CDN
- Global CDN
- CI cluster
- Redundant DNS
- Prerender cluster
- Mongo cluster
- Rails cluster
- 4 cloud providers
- 14 PoPs
4. API cluster
Global CDN
Pre-Render cluster
CI cluster
Distributed systems are cool
buildbot
buildbot
buildbot
buildbot
API
API
API
API
CDN CDN CDN CDN CDN CDN CDN API
API
DB land
13. Immediate Problem
- Make the logs searchable
- Easy to add more logs
Long Term Vision
- A generic system to let services push data out
- An easy way to access that data for new and fun uses
Tool Requirements
- Easy installation
- Good scaling factors
- Secure
Spec before building
@ry_boflavin @netlify
14. And so the story begins...
@ry_boflavin @netlify
Rabbit MQ
- Existing infrastructure
- Didn’t need enterprise messaging features
- Data was only metrics, telemetry and logs
Kafka
- Didn’t want to run zookeeper
- Didn’t need rewind or buffering
15. Creating the Data plane
@ry_boflavin @netlify
logs nats
random service
16. Creating the Data plane
@ry_boflavin @netlify
logs nats
random service
streamer
17. Creating the Data plane
@ry_boflavin @netlify
random service
logs nats
random service
streamer
18. Creating the Data plane
@ry_boflavin @netlify
random service
logs nats
random service
streamer elastinats
es
19. Creating the Data plane
@ry_boflavin @netlify
random service
logs nats
random service
streamer
elastinats
elastinats
elastinats
elastinats
es
20. Creating the Data plane
@ry_boflavin @netlify
random service
logs nats
random service
streamer
taptap
elastinats
elastinats
elastinats
elastinats
es
21. Elastinats lessons
@ry_boflavin @netlify
func(m *nats.Msg) {
stats.IncrementMessagesConsumed()
go func() {
payload := message.NewPayload(string(m.Data), m.Subject)
// maybe it is json!
_ = json.Unmarshal(m.Data, payload)
c <- payload
}()
}
func(m *nats.Msg) {
stats.IncrementMessagesConsumed()
payload := message.NewPayload(string(m.Data), m.Subject)
// maybe it is json!
_ = json.Unmarshal(m.Data, payload)
c <- payload
}
- Don’t block the consumer
26. Future Work
@ry_boflavin @netlify
- Use a nats_metrics library to measure and push to nats
- Add more taps for log analysis
- Migrate legacy services to push based metrics and logs