Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing.
In this talk you will learn more about:
1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why?
2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka
Stateful Stream Processing at In-Memory SpeedJamie Grier
This presentation describes results from a real-world system where I used Apache Flink's stateful stream processing capabilities to eliminate the key-value store bottleneck and the burden of the Lambda Architecture while also improving accuracy and gaining huge improvements in hardware efficiency!
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
In diesem Vortrag wird es zunächst einen kurzen Überblick über den aktuellen Stand im Bereich der Streaming-Datenanalyse geben. Danach wird es mit einer kleinen Einführung in das Apache-Flink-System zur Echtzeit-Datenanalyse weitergehen, bevor wir tiefer in einige der interessanten Eigenschaften eintauchen werden, die Flink von den anderen Spielern in diesem Bereich unterscheidet. Dazu werden wir beispielhafte Anwendungsfälle betrachten, die entweder direkt von Nutzern stammen oder auf unserer Erfahrung mit Nutzern basieren. Spezielle Eigenschaften, die wir betrachten werden, sind beispielsweise die Unterstützung für die Zerlegung von Events in einzelnen Sessions basierend auf der Zeit, zu der ein Ereignis passierte (event-time), Bestimmung von Zeitpunkten zum jeweiligen Speichern des Zustands eines Streaming-Programms für spätere Neustarts, die effiziente Abwicklung bei sehr großen zustandsorientierten Streaming-Berechnungen und die Zugänglichkeit des Zustandes von außerhalb.
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing.
In this talk you will learn more about:
1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why?
2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka
Stateful Stream Processing at In-Memory SpeedJamie Grier
This presentation describes results from a real-world system where I used Apache Flink's stateful stream processing capabilities to eliminate the key-value store bottleneck and the burden of the Lambda Architecture while also improving accuracy and gaining huge improvements in hardware efficiency!
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
In diesem Vortrag wird es zunächst einen kurzen Überblick über den aktuellen Stand im Bereich der Streaming-Datenanalyse geben. Danach wird es mit einer kleinen Einführung in das Apache-Flink-System zur Echtzeit-Datenanalyse weitergehen, bevor wir tiefer in einige der interessanten Eigenschaften eintauchen werden, die Flink von den anderen Spielern in diesem Bereich unterscheidet. Dazu werden wir beispielhafte Anwendungsfälle betrachten, die entweder direkt von Nutzern stammen oder auf unserer Erfahrung mit Nutzern basieren. Spezielle Eigenschaften, die wir betrachten werden, sind beispielsweise die Unterstützung für die Zerlegung von Events in einzelnen Sessions basierend auf der Zeit, zu der ein Ereignis passierte (event-time), Bestimmung von Zeitpunkten zum jeweiligen Speichern des Zustands eines Streaming-Programms für spätere Neustarts, die effiziente Abwicklung bei sehr großen zustandsorientierten Streaming-Berechnungen und die Zugänglichkeit des Zustandes von außerhalb.
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
This presentation from the 13th Flink Meetup in Berlin contains the regular community update for January and a walkthrough of the most important upcoming features in 2016
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:
• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?
Stream Processing: Choosing the Right Tool for the JobDatabricks
Due to the increasing interest in real-time processing, many stream processing frameworks were developed. However, no clear guidelines have been established for choosing a framework for a specific use case. In this talk, two different scenarios are taken and the audience is guided through the thought process and questions that one should ask oneself when choosing the right tool. The stream processing frameworks that will be discussed are Spark Streaming, Structured Streaming, Flink and Kafka Streams.
The main questions are:
How much data does it need to process? (throughput)
Does it need to be fast? (latency)
Who will build it? (supported languages, level of API, SQL capabilities, built-in windowing and joining functionalities, etc)
Is accurate ordering important? (event time vs. processing time)
Is there a batch component? (integration of batch API)
How do we want it to run? (deployment options: standalone, YARN, mesos, …)
How much state do we have? (state store options) – What if a message gets lost? (message delivery guarantees, checkpointing).
For each of these questions, we look at how each framework tackles this and what the main differences are. The content is based on the PhD research of Giselle van Dongen in benchmarking stream processing frameworks in several scenarios using latency, throughput and resource utilization.
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
This talk shows how we can use Apache Flink and Apache Zeppelin to do interactive data analysis. The examples show the usage of FlinkML to solve a linear regression and classification problem.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
ODSC East 2017 - How to use Zeppelin and Spark to document your research.
Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
Jamie Grier - Robust Stream Processing with Apache FlinkFlink Forward
http://flink-forward.org/kb_sessions/robust-stream-processing-with-apache-flink/
In this hands on talk and demonstration I’ll give a very short introduction to stream processing and then dive into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. We’ll focus on correctness and robustness in stream processing. During this live demo we’ll be developing a realtime analytics application and modifying it on the fly based on the topics we’re working though. We’ll exercise Flink’s unique features, demonstrate fault-recovery, clearly explain and demonstrate why Event Time is such an important concept in robust stateful stream processing and talk about and demonstrate the features you need in a stream processor in production. Some of the topics covered will be: – Stateful Stream Processing – Event Time vs. Processing Time – Fault tolerance – State management in the face of faults – Savepoints – Data re-processing – Planned downtime and upgrades
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics.
These are a few takeaways from this talk:
- Adopt Apache Beam for easier development and portability between Big Data Execution Engines.
- Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency.
- Accelerate your Big Data applications with In-Memory open source tools.
- Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices…
- Have Machine Learning part of your strategy or passively watch your industry completely transformed!
- How to advance your strategy for hybrid integration between cloud and on-premise deployments?
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
This presentation from the 13th Flink Meetup in Berlin contains the regular community update for January and a walkthrough of the most important upcoming features in 2016
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:
• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?
Stream Processing: Choosing the Right Tool for the JobDatabricks
Due to the increasing interest in real-time processing, many stream processing frameworks were developed. However, no clear guidelines have been established for choosing a framework for a specific use case. In this talk, two different scenarios are taken and the audience is guided through the thought process and questions that one should ask oneself when choosing the right tool. The stream processing frameworks that will be discussed are Spark Streaming, Structured Streaming, Flink and Kafka Streams.
The main questions are:
How much data does it need to process? (throughput)
Does it need to be fast? (latency)
Who will build it? (supported languages, level of API, SQL capabilities, built-in windowing and joining functionalities, etc)
Is accurate ordering important? (event time vs. processing time)
Is there a batch component? (integration of batch API)
How do we want it to run? (deployment options: standalone, YARN, mesos, …)
How much state do we have? (state store options) – What if a message gets lost? (message delivery guarantees, checkpointing).
For each of these questions, we look at how each framework tackles this and what the main differences are. The content is based on the PhD research of Giselle van Dongen in benchmarking stream processing frameworks in several scenarios using latency, throughput and resource utilization.
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
This talk shows how we can use Apache Flink and Apache Zeppelin to do interactive data analysis. The examples show the usage of FlinkML to solve a linear regression and classification problem.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
ODSC East 2017 - How to use Zeppelin and Spark to document your research.
Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
Jamie Grier - Robust Stream Processing with Apache FlinkFlink Forward
http://flink-forward.org/kb_sessions/robust-stream-processing-with-apache-flink/
In this hands on talk and demonstration I’ll give a very short introduction to stream processing and then dive into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. We’ll focus on correctness and robustness in stream processing. During this live demo we’ll be developing a realtime analytics application and modifying it on the fly based on the topics we’re working though. We’ll exercise Flink’s unique features, demonstrate fault-recovery, clearly explain and demonstrate why Event Time is such an important concept in robust stateful stream processing and talk about and demonstrate the features you need in a stream processor in production. Some of the topics covered will be: – Stateful Stream Processing – Event Time vs. Processing Time – Fault tolerance – State management in the face of faults – Savepoints – Data re-processing – Planned downtime and upgrades
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics.
These are a few takeaways from this talk:
- Adopt Apache Beam for easier development and portability between Big Data Execution Engines.
- Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency.
- Accelerate your Big Data applications with In-Memory open source tools.
- Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices…
- Have Machine Learning part of your strategy or passively watch your industry completely transformed!
- How to advance your strategy for hybrid integration between cloud and on-premise deployments?
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Step-by-Step Introduction to Apache Flink Slim Baltagi
This a talk that I gave at the 2nd Apache Flink meetup in Washington DC Area hosted and sponsored by Capital One on November 19, 2015. You will quickly learn in step-by-step way:
How to setup and configure your Apache Flink environment?
How to use Apache Flink tools?
3. How to run the examples in the Apache Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink?
5. How to write your Apache Flink program in an IDE?
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
Slim Baltagi & Rick Fath. Closing Keynote: Big Data Executive Summit. Chicago 11/28/2012.
PART I – Hadoop at CME: Our Practical Experience
1. What’s CME Group Inc.?
2. Big Data & CME Group: a natural fit!
3. Drivers for Hadoop adoption at CME Group
4. Key Big Data projects at CME Group
5. Key Learning’s
PART II - Bringing Hadoop to the Enterprise:
Challenges & Opportunities
PART II - Bringing Hadoop to the Enterprise
1. What is Hadoop, what it isn’t and what it can help you do?
2. What are the operational concerns and risks?
3. What organizational changes to expect?
4. What are the observed Hadoop trends?
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Big Data at CME Group: Challenges and Opportunities Slim Baltagi
Presentation given on September 18, 2012 at the 'Hadoop in Finance Day' conference held in Chicago and organized by Fountainhead Lab at Microsoft's offices.
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...Flink Forward
Many use cases in the telecommunication industry require producing counters, quality metrics, and alarms in a streaming fashion with very low latency. Most of this metrics are only valuable when they’re made available as soon as the associated events happened. In our company we are looking for a system able to produce this kind of real-time indicator, which must handle massive amounts of data (400,000 eps) with often peak loads (like New Year’s Eve) or out-of-order events like massive network disorder. Low latency and flexible window management with specific watermark emission are also a must-haves. Heterogeneous format, multiple flow correlation, and the possibility of late data arrival are other challenges. Flink being already widely used at Bouygues Telecom for real-time data integration, its features made it the evident candidate for the future System. In this talk, we'll present a real use case of streaming analytics using Flink, Kafka & HBase along with other legacy systems.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
I compare Apache Flink to Apache Spark, Apache Tez, and MapReduce in Apache Hadoop in terms of performance. I run experiments using two benchmarks, Terasort and Hashjoin.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationTimothy Spann
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
@PaasDev www.datainmotion.dev github.com/tspannhw medium.com/@tspann
Principal Developer Advocate
Princeton Future of Data Meetup
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC, ex-EY, ex-HPE.
Apache NiFi x Apache Kafka x Apache Flink
There are a lot of factors involved in determining how you can find our way around and avoid delays, bad weather,dangers and expenses. In this talk I will focus on public transport in the largest transit system in the United States, the MTA,
which is focused around New York City. Utilizing public and semi-public data feeds, this can be extended to most city and metropolitan areas around the world. As a personal example, I live in New Jersey and this is an extremely useful use of open source and public
data.
Once I am notified that I need to travel to Manhattan, I need to start my data streams flowing. Most of the data sources are REST feeds that are ingested by Apache NiFi to transform, convert, enrich and finalize it for usage in streaming tables with Flink SQL, but also keep that same contract with Kafka consumers, Iceberg tables and other users of this data. I do not need to many user interfaces to interopt with the system as I want my final decision sent in a Slack message to me and then I’ll get moving. Along the way data will be visible in NiFi lineage, Kafka topic views, Flink SQL output, REST output and Iceberg tables.
Apache NiFi, Apache Kafka, Apache OpenNLP, Apache Tika, Apache Flink, Apache Avro, Apache Parquet, Apache Iceberg.
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
https://medium.com/@tspann/open-source-streaming-talks-in-progress-3e75af8848b0
https://medium.com/@tspann/watching-airport-traffic-in-real-time-32c522a6e386
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, V.P. of Apache Beam; Founder/CEO at Operiant
Aljoscha Krettek offers a very short introduction to stream processing before diving into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible, with a focus on correctness and robustness in stream processing.
All of this will be done in the context of a real-time analytics application that we’ll be modifying on the fly based on the topics we’re working though, as Aljoscha exercises Flink’s unique features, demonstrates fault recovery, clearly explains why event time is such an important concept in robust, stateful stream processing, and covers the features you need in a stream processor to do robust, stateful stream processing in production.
We’ll also use a real-time analytics dashboard to visualize the results we’re computing in real time, allowing us to easily see the effects of the code we’re developing as we go along.
Topics include:
* Apache Flink
* Stateful stream processing
* Event time versus processing time
* Fault tolerance
* State management in the face of faults
* Savepoints
* Data reprocessing
In this presentation Guido Schmutz talks about Apache Kafka, Kafka Core, Kafka Connect, Kafka Streams, Kafka and "Big Data"/"Fast Data Ecosystems, Confluent Data Platform and Kafka in Architecture.
Similar to Apache Flink community Update for March 2016 - Slim Baltagi (20)
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Apache Flink community Update for March 2016 - Slim Baltagi
1. Apache Flink Community Update
February 2016
Slim Baltagi
@SlimBaltagi
sbaltagi@gmail.com
New York City (NYC) Apache Flink Meetup,
March 3rd, 2016
2. Flink 1.0
Apache Flink 1.0.0 release is being
finalized. It is being voted and will be
available sometimes this month: March
2016!
Apache Flink 0.10 to 1.0 Migration
Guide will contain the API breaking
changes and deprecated components.
2
3. Flink in Action
Apache Flink in Action is probably the
First book on Apache Flink!
It will be published by Manning. It is being
co-authored by Sameer Wadkar
(@wadkar_sameer ),Slim Baltagi
(@SlimBaltagi) and Suneel
Marthi(@suneelmarthi)
Please stay tuned for the MEAP: Manning
Early Access Program!
3
4. Reading List
Dataflow/Beam & Spark: A Programming Model
Comparison. February 3rd, 2016
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-
comparison
How Apache Flink enables new streaming
applications Part II: State and versioning by Ufuk Celebi
and Kostas Tzoumas. February 3rd, 2016
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
The Essential Guide to Streaming-first Processing with
Apache Flink, Fabian Hueske and Kostas
Tzoumashttps://www.mapr.com/blog/essential-guide-streaming-first-
processing-apache-flink by Fabian Hueske and Kostas Tzoumas.
Apache Flink vs Apache Spark - Reproducible
experiments on
cloud Slides: http://www.slideshare.net/shelan1/apache-flink-vs-
apache-spark-reproducible-experiments-on-cloud
4
6. Global Flink Meetup Community
New Meetups:
Europe
• Apache Flink en Madrid: Feb. 4th , 2016
• Apache Flink London Meetup: Feb. 10th , 2016
Asia
• Bengaluru Apache Flink Meetup: Feb. 15th , 2016
America
• Seattle Apache Flink Meetup: Feb. 29, 2016
New York City (NYC) Apache Flink Meetup is now the
world’s largest Apache Flink meetup! Sorry Berlin and Bay
Area!
6
7. Upcoming talks at Conferences – March 2016
Qcon London, March 7-9, 2016. Talk by Robert
Metzger, dataArtisans:
• Stream Processing with Apache Flink
https://qconlondon.com/presentation/stream-
processing-apache-flink
Strata + Hadoop World ( 1 talk) (March 31, 2016).
Stephan Ewen, dataArtisans:
• Apache Flink: Streaming done right
http://conferences.oreilly.com/strata/hadoop-big-
data-ca/public/schedule/detail/47190
7
8. Upcoming talks at Conferences – April 2016
Hadoop Summit Dublin, April 13-14, 2016. 2 talks:
http://hadoopsummit.org/dublin/agenda/
• Overview of Apache Flink; the 4G of Big Data
Analytics Frameworks, Slim Baltagi, Capital One
• Unified Stream & Batch Processing with Apache
Flink, Ufuk Celebi, data Artisans
Kafka Summit, April 26, 2016
• Advanced Streaming Analytics with Apache Flink and
Apache Kafka, Stephan Ewen, data
Artisanshttp://kafka-summit.org/sessions/advanced-
streaming-analytics-with-apache-flink-and-apache-
kafka/
8
9. Few Upcoming Meetups
Talk about Apache Flink 1.0 and its new features by
Stephan Ewen, CEO of data Artisans:
• Washington DC Area Apache Flink Meetup, April 7th,
2016. http://www.meetup.com/Washington-DC-Area-
Apache-Flink-Meetup/
• New York City (NYC) Apache Flink Meetup, April 5th
or April 6th. Date to be confirmed.
http://www.meetup.com/New-York-City-NYC-Apache-
Flink-Meetup/
9
10. Happenings in the Flink Community
This is a daily updated list for each
month as Flink articles and blogs are
published, announcements are made
about Apache Flink releases, meetups,
conferences, ...
• February 2016 http://sparkbigdata.com/102-spark-blog-
slim-baltagi/28-happenings-in-the-flink-community-
february-2016
• March 2016http://www.sparkbigdata.com/102-spark-blog-
slim-baltagi/29-happenings-in-the-flink-community-march-
2016
10