This a talk that I gave at the 2nd Apache Flink meetup in Washington DC Area hosted and sponsored by Capital One on November 19, 2015. You will quickly learn in step-by-step way:
How to setup and configure your Apache Flink environment?
How to use Apache Flink tools?
3. How to run the examples in the Apache Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink?
5. How to write your Apache Flink program in an IDE?
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Apache Flink 101 - the rise of stream processing and beyondBowen Li
Apache Flink is the most popular and widely adopted streaming processing framework, powering real time stream event computations at extremely large scale in companies like Uber, Lyft, AWS, Alibaba, Pinterest, Splunk, Yelp, etc.
In this talk, we will go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink fills the gaps and stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will also take a look at how Flink is going beyond stream processing into areas like unified data processing, enterprise intergration, AI/machine learning (especially online ML), and serverless computation, and how Flink fits with its distinct value.
SPEAKER: Bowen Li
SPEAKER BIO: Bowen is a committer of Apache Flink, senior engineer at Alibaba, and host of Seattle Flink Meetup.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Lessons learned from writing over 300,000 lines of infrastructure codeYevgeniy Brikman
This talk is a concise masterclass on how to write infrastructure code. I share key lessons from the “Infrastructure Cookbook” we developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Come and hear our war stories, laugh about all the mistakes we’ve made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild.
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Apache Flink 101 - the rise of stream processing and beyondBowen Li
Apache Flink is the most popular and widely adopted streaming processing framework, powering real time stream event computations at extremely large scale in companies like Uber, Lyft, AWS, Alibaba, Pinterest, Splunk, Yelp, etc.
In this talk, we will go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink fills the gaps and stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will also take a look at how Flink is going beyond stream processing into areas like unified data processing, enterprise intergration, AI/machine learning (especially online ML), and serverless computation, and how Flink fits with its distinct value.
SPEAKER: Bowen Li
SPEAKER BIO: Bowen is a committer of Apache Flink, senior engineer at Alibaba, and host of Seattle Flink Meetup.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Lessons learned from writing over 300,000 lines of infrastructure codeYevgeniy Brikman
This talk is a concise masterclass on how to write infrastructure code. I share key lessons from the “Infrastructure Cookbook” we developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Come and hear our war stories, laugh about all the mistakes we’ve made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild.
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Github - Git Training Slides: FoundationsLee Hanxue
Slide deck with detailed step breakdown that explains how git works, together with simple examples that you can try out yourself. Slides originated from http://teach.github.com/articles/course-slides/
Author: https://twitter.com/matthewmccull
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:
• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?
As applications grow from single Rails applications to complex systems with multiple, interacting applications & web services, testing becomes more and more difficult. While we can test each application independently, we need to be able to test the full stack. This presentation shows methods, tools and tipps & tricks from testing such a complex application.
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Github - Git Training Slides: FoundationsLee Hanxue
Slide deck with detailed step breakdown that explains how git works, together with simple examples that you can try out yourself. Slides originated from http://teach.github.com/articles/course-slides/
Author: https://twitter.com/matthewmccull
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:
• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?
As applications grow from single Rails applications to complex systems with multiple, interacting applications & web services, testing becomes more and more difficult. While we can test each application independently, we need to be able to test the full stack. This presentation shows methods, tools and tipps & tricks from testing such a complex application.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
As companies adopt data processing technologies and add data-driven features to user-facing products, the need for effective automated test techniques for data processing applications increase. We go through anatomy of scalable data streaming applications, and how to set up test harnesses for reliable integration testing of such applications. We cover a few common anti-patterns that make asynchronous tests fragile, and corresponding patterns for remediation. We will also mention virtualisation components suitable for our testing scenarios.
Test strategies for data processing pipelines, v2.0Lars Albertsson
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
I would like to show you guys how easy is it to create a new VM from a base Ubuntu, configure and pack it again so you can distribute to your developers and also to the community.
Speed up the initial setup and have a homogeneous dev environment within your team!
BLCN532 Lab 1Set up your development environmentV2.0.docxmoirarandell
BLCN532 Lab 1
Set up your development environment
V2.0
Introduction
This course introduces students to blockchain development for enterprise environments. Before you can develop software applications, you need to ensue your development environment is in place. That means you’ll need all the tools and infrastructure installed and configured to support enterprise blockchain software development projects.
In this lab you’ll set up your own Hyperledger Fabric development environment and install the course software from the textbook. When you finish this lab, you’ll have a working development environment and will be ready to start running and modifying blockchain applications.
The instructions in your textbook are for Mac and Linux computers.
However
, there is no guarantee that your installation of MacOS or Linux is completely compatible with the environment in which the commands from the textbook work properly. For that reason, I
STRONGLY SUGGEST
that you acquire an Ubuntu 16.04 Virtual Machine (VM) for your labs. Using an Ubuntu 16.04 VM will make the labs far easier to complete.
The instructions in this course’s labs assume that your computer runs the Windows operating system. If you run MacOS or Linux, you can get
Vagrant
and
VirtualBox
for those operating systems and follow the gist of the “Initial setup for Windows computers”.
Lab Deliverables:
To complete this lab, you must create a
Lab Report file
and submit the file in iLearn. The Lab Report file must be a Microsoft Word format (.docx), and have the filename with the following format:
BLCN532_SECTION_STUDENTID_LASTNAME_FIRSTNAME_Lab01.docx
· SECTION is the section number of your current course (2 digits)
· STUDENTID is your student ID number (with leading zeros)
· LASTNAME is your last name, FIRSTNAME is your first name
To get started, create a Microsoft Word document (.docx) with the correct filename for this lab. You’ll be asked to enter text and paste screenshots into the lab report file.
NOTE: All screenshots MUST be readable. Use the Ubuntu Screen Capture utility (see the lab video.) Make sure that you label each screenshot (i.e. Step 2.1.3) and provide screenshots in order. For commands that produce lots of output, I only want to see the last full screen when the command finishes. Provide FULL screenshots, NOT cropped images.
SECTION 1: Initial setup for Windows computers (Chapter 3)
Step 1.1: Install Oracle Virtualbox (Windows, Linux, MacOS)
Oracle Virtualbox is an open source virtualization environment that allows you to run multiple virtual machines and containers on a single personal computer. Virtualbox is free and it is easy to install.
In your favorite web browser, navigate to:
https://www.virtualbox.org/
and click the “Download Virtualbox” button. Click the “Windows hosts” link to download the main installation executable. You should also click the “All supported platforms” under the “Extension Pack” heading to download extra software supp.
Using Docker to build and test in your laptop and JenkinsMicael Gallego
Docker is changing the way we create and deploy software. This presentation is a hands-on introduction to how to use docker to build and test software, in your laptop and in your Jenkins CI server
Marco Cavallini - Yocto Project, an automatic generator of embedded Linux dis...linuxlab_conf
The Yocto Project is an open source collaboration project that provides models, tools and methods to create custom Linux-based systems for embedded products that are independent from the adopted hardware architecture. The project was created in 2010 as a collaboration among several hardware manufacturers, open-source operating system providers and electronics companies to bring some order into the chaos of Linux Embedded development. Over the years, Yocto Project has established itself as the de-facto standard for the generation of embedded Linux systems, surpassing alternative products thanks to its characteristics.
The free tools that Yocto provides are powerful and easily generated (including emulation environments, debuggers, an application generator toolkit, etc.). The complete abstraction from the hardware of the development environment allows to optimize the investments made during the prototyping phase. The Yocto Project encourages the adoption of this technology by the open source community allowing users to focus on the characteristics and development of their product.
Why everyone is excited about Docker (and you should too...) - Carlo Bonamic...Codemotion
In less than two years Docker went from first line of code to major Open Source project with contributions from all the big names in IT. Everyone is excited, but what's in for me - as a Dev or Ops? In short, Docker makes creating Development, Test and even Production environments an order of magnitude simpler, faster and completely portable across both local and cloud infrastructure. We will start from Docker main concepts: how to create a Linux Container from base images, run your application in it, and version your runtimes as you would with source code, and finish with a concrete example.
Dockerizing Symfony2 application. Why Docker is so cool And what is Docker? And what are Containers? How they works? What are the ecosystem of Docker? And how to dockerize your web application (can be based on Symfony2 framework)?
Preparation study for Docker Event
Mulodo Open Study Group (MOSG) @Ho chi minh, Vietnam
http://www.meetup.com/Open-Study-Group-Saigon/events/229781420/
IBM Index 2018 Conference Workshop: Modernizing Traditional Java App's with D...Eric Smalling
Slides from my 2.5 hour hands-on workshop covering Docker basics, the Docker MTA program and how it applies to legacy Java applications and some tips on running those apps in containers in production.
How to select a modern data warehouse and get the most out of it?Slim Baltagi
In the first part of this talk, we will give a setup and definition of modern cloud data warehouses as well as outline problems with legacy and on-premise data warehouses.
We will speak to selecting, technically justifying, and practically using modern data warehouses, including criteria for how to pick a cloud data warehouse and where to start, how to use it in an optimum way and use it cost effectively.
In the second part of this talk, we discuss the challenges and where people are not getting their investment. In this business-focused track, we cover how to get business engagement, identifying the business cases/use cases, and how to leverage data as a service and consumption models.
In this presentation, we:
1. Look at the challenges and opportunities of the data era
2. Look at key challenges of the legacy data warehouses such as data diversity, complexity, cost, scalabilily, performance, management, ...
3. Look at how modern data warehouses in the cloud not only overcome most of these challenges but also how some of them bring additional technical innovations and capabilities such as pay as you go cloud-based services, decoupling of storage and compute, scaling up or down, effortless management, native support of semi-structured data ...
4. Show how capabilities brought by modern data warehouses in the cloud, help businesses, either new or existing ones, during the phases of their lifecycle such as launch, growth, maturity and renewal/decline.
5. Share a Near-Real-Time Data Warehousing use case built on Snowflake and give a live demo to showcase ease of use, fast provisioning, continuous data ingestion, support of JSON data ...
Modern big data and machine learning in the era of cloud, docker and kubernetesSlim Baltagi
There is a major shift in web and mobile application architecture from the ‘old-school’ one to a modern ‘micro-services’ architecture based on containers. Kubernetes has been quite successful in managing those containers and running them in distributed computing environments.
Now enabling Big Data and Machine Learning on Kubernetes will allow IT organizations to standardize on the same Kubernetes infrastructure. This will propel adoption and reduce costs.
Kubeflow is an open source framework dedicated to making it easy to use the machine learning tool of your choice and deploy your ML applications at scale on Kubernetes. Kubeflow is becoming an industry standard as well!
Both Kubernetes and Kubeflow will enable IT organizations to focus more effort on applications rather than infrastructure.
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing.
In this talk you will learn more about:
1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why?
2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka
Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications.
This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output.
You will learn more about the following:
1. Apache Kafka: a Streaming Data Platform
2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples?
3. Writing, deploying and running your first Kafka Streams application
4. Code and Demo of an end-to-end Kafka-based Streaming Data Application
5. Where to go from here?
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
Kafka as a streaming data platform is becoming the successor to traditional messaging systems such as RabbitMQ. Nevertheless, there are still some use cases where they could be a good fit. This one single slide tries to answer in a concise and unbiased way where to use Apache Kafka and where to use RabbitMQ. Your comments and feedback are much appreciated.
This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics.
These are a few takeaways from this talk:
- Adopt Apache Beam for easier development and portability between Big Data Execution Engines.
- Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency.
- Accelerate your Big Data applications with In-Memory open source tools.
- Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices…
- Have Machine Learning part of your strategy or passively watch your industry completely transformed!
- How to advance your strategy for hybrid integration between cloud and on-premise deployments?
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
Big Data at CME Group: Challenges and Opportunities Slim Baltagi
Presentation given on September 18, 2012 at the 'Hadoop in Finance Day' conference held in Chicago and organized by Fountainhead Lab at Microsoft's offices.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
Slim Baltagi & Rick Fath. Closing Keynote: Big Data Executive Summit. Chicago 11/28/2012.
PART I – Hadoop at CME: Our Practical Experience
1. What’s CME Group Inc.?
2. Big Data & CME Group: a natural fit!
3. Drivers for Hadoop adoption at CME Group
4. Key Big Data projects at CME Group
5. Key Learning’s
PART II - Bringing Hadoop to the Enterprise:
Challenges & Opportunities
PART II - Bringing Hadoop to the Enterprise
1. What is Hadoop, what it isn’t and what it can help you do?
2. What are the operational concerns and risks?
3. What organizational changes to expect?
4. What are the observed Hadoop trends?
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. Step-By-Step Introduction to
Apache Flink
[Setup, Configure, Run, Tools]
Slim Baltagi @SlimBaltagi
Washington DC Area Apache Flink Meetup
November 19th
, 2015
2. 2
For an overview of Apache Flink, see slides at http://www.slideshare.net/sbaltagi
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python)
Batch Processing
DataStream (Java/Scala)
Stream Processing
HadoopM/R
Local
Single JVM
Embedded
Docker
Cluster
Standalone
YARN, Tez,
Mesos (WIP)
Cloud
Google’s GCE
Amazon’s EC2
IBM Docker Cloud, …
GoogleDataflow
Dataflow(WiP)
MRQL
Table
Cascading
Runtime
Distributed Streaming
Dataflow
Zeppelin
DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE
Files
Local
HDFS
S3
Tachyon
Databases
MongoDB
HBase
SQL
…
Streams
Flume
Kafka
RabbitMQ
…
Batch Optimizer Stream Builder
3. 3
Agenda
1. How to setup and configure your
Apache Flink environment?
2. How to use Apache Flink tools?
3. How to run the examples in the Apache
Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or
Eclipse) for Apache Flink?
5. How to write your Apache Flink program
in an IDE?
4. 4
1. How to setup and configure your Apache
Flink environment?
1.1 Local (on a single machine)
1.2 Flink in a VM image (on a single machine)
1.3 Flink on Docker
1.4 Standalone Cluster
1.5 Flink on a YARN Cluster
1.6 Flink on the Cloud
5. 5
1.1 Local (on a single machine)
Flink runs on Linux, OS X and Windows. In order to
execute a program on a running Flink instance (and not
from within your IDE) you need to install Flink on your
machine.
The following steps will be detailed for both Unix-Like
(Linux, Mac OS X) as well as Windows environments:
1.1.1 Verify requirements
1.1.2 Download the Flink binary package
1.1.3 Unpack the downloaded archive
1.1.4 Configure
1.1.5 Start a local Flink instance
1.1.6 Validate Flink is running
1.1.7 Run a Flink example
1.1.8 Stop the local Flink instance
6. 6
1.1 Local (on a single machine)
1.1.1 Verify requirements
The machine that Flink will run on, must have Java
1.7.x or higher installed. To check the Java version
installed issue the command: java –version. The out of
the box configuration will use your default Java
installation.
Optional: If you want to manually override the Java
runtime to use, set the JAVA_HOME environment variable
in Unix-like environment. To check if JAVA_HOME is set,
issue the command: echo $JAVA_HOME.
If needed, follow the instructions for installing Java and
setting JAVA_HOME on a Unix system here: https
://docs.oracle.com/cd/E19182-01/820-7851/inst_set_jdk_korn_bash_t/
index.html
7. 7
1.1 Local (on a single machine)
In Windows environment, check the correct
installation of Java by issuing the following
command: java –version.
The bin folder of your Java Runtime Environment
must be included in Window’s %PATH% variable. If
needed, follow this guide to add Java to the path
variable. http://www.java.com/en/download/help/path.xml
If needed, follow the instructions for installing Java
and setting JAVA_HOME on a Windows system here:
https://docs.oracle.com/cd/E19182-01/820-7851/inst_set_jdk_windows_t/ind
8. 8
1.1 Local (on a single machine)
1.1.2 Download the Flink binary package
The latest stable release of Apache Flink can be
downloaded from http://flink.apache.org/downloads.html
For example: In Linux-Like environment, run the
following command:
wget https://www.apache.org/dist/flink/flink-0.10.0/flink-
0.10.0-bin-hadoop1-scala_2.10.tgz
Which version to pick?
• You don’t have to install Hadoop to use Flink.
• But if you plan to use Flink with data stored in
Hadoop, pick the version matching your installed
Hadoop version.
• If you don’t want to do this, pick the Hadoop 1
version.
9. 9
1.1 Local (on a single machine)
1.1.3 Unpack the downloaded .tgz archive
Example:
$ cd ~/Downloads # Go to download directory
$ tar -xvzf flink-*.tgz # Unpack the downloaded archive
$ cd flink-0.10.0
$ ls –l
10. 10
1.1 Local (on a single machine)
1.1.4. Configure
The resulting folder contains a Flink setup that can
be locally executed without any further
configuration.
flink-conf.yaml under flink-0.10.0/conf contains the
default configuration parameters that allow Flink to
run out-of-the-box in single node setups.
11. 11
1.1 Local (on a single machine)
1.1.5. Start a local Flink instance:
• Given that you have a local Flink installation, you
can start a Flink instance that runs a master and a
worker process on your local machine in a single
JVM.
• This execution mode is useful for local testing.
• On UNIX-Like system you can start a Flink instance
as follows:
cd /to/your/flink/installation
./bin/start-local.sh
12. 12
1.1 Local (on a single machine)
1.1.5. Start a local Flink instance:
On Windows you can either start with:
• Windows Batch Files by running the following
commands
cd C:toyourflinkinstallation
.binstart-local.bat
• or with Cygwin and Unix Scripts: start the Cygwin
terminal, navigate to your Flink directory and run
the start-local.sh script
$ cd /cydrive/c
cd flink
$ bin/start-local.sh
13. 13
1.1 Local (on a single machine)
The JobManager (the master of the distributed system)
automatically starts a web interface to observe program
execution. It runs on port 8081 by default (configured
in conf/flink-config.yml). http://localhost:8081/
1.1.6 Validate that Flink is running
You can validate that a local Flink instance is running
by:
• Issuing the following command: $ jps
jps: java virtual machine process status tool
• Looking at the log files in ./log/
$ tail log/flink-*-jobmanager-*.log
• Opening the JobManager’s web interface at
$ open http://localhost:8081
14. 14
1.1 Local (on a single machine)
1.1.7 Run a Flink example
• On UNIX-Like system you can run a Flink example as follows:
cd /to/your/flink/installation
./bin/flink run ./examples/WordCount.jar
• On Windows Batch Files, open a second terminal and run the
following commands”
cd C:toyourflinkinstallation
.binflink.bat run .examplesWordCount.jar
1.1.8 Stop local Flink instance
•On UNIX you call ./bin/stop-local.sh
•On Windows you quit the running process with Ctrl+C
16. 16
1.2 VM image (on a single machine)
Please send me an email to sbaltagi@gmail.com for a
link from which you can download a Flink Virtual
Machine.
The Flink VM, which is approximately 4 GB, is from
data Artisans http://data-artisans.com/
It currently has Flink 0.10.0, Kafka, IDEs (IntelliJ,
Eclipse), Firefox, …
It will contain soon the FREE training from data
Artisans for Flink 0.10.0
Meanwhile, an older version, based on 0.9.1, of this
FREE training is is available from http
://dataartisans.github.io/flink-training/
17. 17
1.3 Docker
Docker can be used for local development.
Container based virtualization advantages:
lightweight and portable;
build once
run anywhere
ease of packaging applications
automated and scripted
isolated
Often resource requirements on data
processing clusters exhibit high variation.
Elastic deployments reduce TCO (Total Cost
of Ownership).
18. 18
1.3 Flink on Docker
Apache Flink cluster deployment on Docker using
Docker-Compose by Simons Laws from IBM. Talk at
the Flink Forward in Berlin on October 12, 2015.
Slides:
http://www.slideshare.net/FlinkForward/simon-laws-apache-flink-cluster-d
dockercompose
Video recording (40’:49): https://www.youtube.com/watch?v=CaObaAv9tLE
The talk:
• Introduces the basic concepts of container isolation
exemplified on Docker
• Explain how Apache Flink is made elastic using
Docker-Compose.
• Show how to push the cluster to the cloud
exemplified on the IBM Docker Cloud.
19. 19
1.3 Flink on Docker
Apache Flink dockerized: This is a set of scripts to
create a local multi-node Flink cluster, each node
inside a docker container.
https://hub.docker.com/r/gustavonalle/flink/
Using docker to setup a development environment
that is reproducible. Apache Flink cluster deployment
on Docker using Docker-Compose
https://github.com/apache/flink/tree/master/flink-contrib/d
Web resources to learn more about Docker
http://www.flinkbigdata.com/component/tags/tag/47-dock
20. 20
1.4 Standalone Cluster
See quick start - Cluster
https://ci.apache.org/projects/flink/flink-docs-rele
setup
See instructions on how to run Flink in a fully
distributed fashion on a static (possibly
heterogeneous) cluster. This involves two
steps:
• Installing and configuring Flink
• Installing and configuring the Hadoop
Distributed File System (HDFS)
https://ci.apache.org/projects/flink/flink-docs-master/setup/cluster_setup.ht
21. 21
1.5 Flink on a YARN Cluster
You can easily deploy Flink on your
existing YARN cluster:
1. Download the Flink Hadoop2 package: Flink with Hadoop 2
2. Make sure your HADOOP_HOME (or YARN_CONF_DIR or
HADOOP_CONF_DIR) environment variable is set to read your
YARN and HDFS configuration.
– Run the YARN client with: ./bin/yarn-session.sh You can run
the client with options -n 10 -tm 8192 to allocate 10
TaskManagers with 8GB of memory each.
For more detailed instructions, please check out the
documentation:
https://ci.apache.org/projects/flink/flink-docs-master/setu
yarn_setup.html
22. 22
1.6 Flink on the Cloud
1.6.1 Google Compute Engine (GCE)
1.6.2 Amazon EMR
23. 23
1.6 Cloud
1.6.1 Google Compute Engine
Free trial for Google Cloud Engine:
https://cloud.google.com/free-trial/
Enjoy your $300 in GCE for 60 days!
Now, how to setup Flink with Hadoop 1 or
Hadoop 2 on top of a Google Compute Engine
cluster?
Google’s bdutil starts a cluster and deploys
Flink with Hadoop. To get started, just follow the
steps here: https
://ci.apache.org/projects/flink/flink-docs-master/se
24. 24
1.6 Cloud
1.6.2 Amazon EMR
Amazon Elastic MapReduce (Amazon EMR) is a web
service providing a managed Hadoop framework.
• http://aws.amazon.com/elasticmapreduce/
• http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is
emr.html
Example: Use Stratosphere with Amazon Elastic
MapReduce, February 18, 2014 by Robert Metzger
https://flink.apache.org/news/2014/02/18/amazon-elastic-mapreduce-cloud-yarn.html
Use pre-defined cluster definition to deploy Apache
Flink using Karamel web app http://www.karamel.io/
Getting Started – Installing Apache Flink on Amazon
EC2 by Kamel Hakimzadeh. Published on October 12,
2015 https://www.youtube.com/watch?v=tCIA8_2dR14
25. 25
Agenda
1. How to setup and configure your Apache
Flink environment?
2. How to use Apache Flink tools?
3. How to run the examples in the Apache
Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or
Eclipse) for Apache Flink?
5. How to write your Apache Flink program
in an IDE?
26. 26
2. How to use Apache Flink tools?
2.1 Command-Line Interface (CLI)
2.2 Web Submission Client
2.3 Job Manager Web Interface
2.4 Interactive Scala Shell
2.5 Apache Zeppelin Notebook
27. 27
2.1 Command-Line Interface (CLI)
Flink provides a CLI to run programs that are
packaged as JAR files, and control their execution.
bin/flink has 4 major actions
• run #runs a program.
• info #displays information about a program.
• list #lists scheduled and running jobs
• cancel #cancels a running job.
Example: ./bin/flink info ./examples/KMeans.jar
See CLI usage and related examples:
https://ci.apache.org/projects/flink/flink-docs-master/apis/cli.html
29. 29
2.2 Web Submission Client
Flink provides a web interface to:
• Upload programs
• Execute programs
• Inspect their execution plans
• Showcase programs
• Debug execution plans
• Demonstrate the system as a whole
The web interface runs on port 8080 by default.
To specify a custom port set the webclient.port
property in the ./conf/flink.yaml configuration file.
30. 30
2.2 Web Submission Client
Start the web interface by executing:
./bin/start-webclient.sh
Stop the web interface by executing:
./bin/stop-webclient.sh
• Jobs are submitted to the JobManager
specified
by jobmanager.rpc.address and jobmanager.rpc.port
• For more details and further configuration
options, please consult this webpage:
https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html#
webclient
31. 31
2.3 Web Submission Client
The JobManager (the master of the
distributed system) starts a web interface
to observe program execution.
It runs on port 8081 by default (configured
in conf/flink-config.yml).
Open the JobManager’s web interface
at
http://localhost:8081
• jobmanager.rpc.port 6123
• jobmanager.web.port 8081
32. 32
2.3 Job Manager Web Interface
Overall system status
Job execution details
Task Manager resource
utilization
49. 49
Run K-Means example
1. Generate Input Data
Flink contains a data generator for K-Means that has the
following arguments (arguments in [] are optional):
-points <num> -k <num clusters> [-output <output-path>] [-
stddev <relative stddev>] [-range <centroid range>] [-seed
<seed>]
Go to the Flink root installation:
$ cd flink-0.10.0
Create a new directory that will contains the data:
$ mkdir kmeans
$ cd kmeans
50. 50
Run K-Means example
Create some data using Flink's tool:
java -cp ../examples/KMeans.jar:../lib/flink-dist-0.10.0.jar
org.apache.flink.examples.java.clustering.util.KMeansD
ataGenerator -points 500 -k 10 -stddev 0.08 -output
`pwd`
The directory should now contain the files "centers"
and "points".
51. 51
Run K-Means example
Continue following the instructions on Quick
Start: Run K-Means Example as outlined here:
https://ci.apache.org/projects/flink/flink-docs-releas
Happy Flinking!
52. 52
Agenda
1. How to setup and configure your
Apache Flink environment?
2. How to use Apache Flink tools?
3. How to run the examples in the Apache
Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or
Eclipse) for Apache Flink?
5. How to write your Apache Flink program
in an IDE?
53. 53
4. How to set up your IDE (IntelliJ IDEA or
Eclipse) for Apache Flink?
4.1 How to set up your IDE (IntelliJ IDEA)?
4.2 How to setup your IDE (Eclipse)?
Flink uses mixed Scala/Java projects, which
pose a challenge to some IDEs
Minimal requirements for an IDE are:
• Support for Java and Scala (also mixed projects)
• Support for Maven with Java and Scala
54. 54
4.1 How to set up your IDE (IntelliJ
IDEA)?IntelliJ IDEA supports Maven out of the box
and offers a plugin for Scala development.
IntelliJ IDEA Download https
://www.jetbrains.com/idea/download/
IntelliJ Scala Plugin
http://plugins.jetbrains.com/plugin/?id=1347
Check out Setting up IntelliJ IDEA guide for
details
https://github.com/apache/flink/blob/master/docs/internals/ide_setup.md#intel
Screencast: Run Apache Flink WordCount
from IntelliJ
https://www.youtube.com/watch?v=JIV_rX-
OIQM
55. 55
4.2 How to setup your IDE (Eclipse)?
• For Eclipse users, Apache Flink committers
recommend using Scala IDE 3.0.3, based on
Eclipse Kepler.
• While this is a slightly older version, they
found it to be the version that works most
robustly for a complex project like Flink. One
restriction is, though, that it works only with
Java 7, not with Java 8.
• Check out how to setup Eclipse docs:
https://github.com/apache/flink/blob/master/docs/internals/ide_setup.md
#eclipse
56. 56
Agenda
1. How to setup and configure your
Apache Flink environment?
2. How to use Apache Flink tools?
3. How to run the examples in the Apache
Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or
Eclipse) for Apache Flink?
5. How to write your Apache Flink program
in an IDE?
57. 57
5. How to write your Apache
Flink program in an IDE?
5.1 How to write a Flink program in an IDE?
5.2 How to generate a Flink project with
Maven?
5.3 How to import the Flink Maven project
into IDE
5.4 How to use logging?
5.5 FAQs and best practices related to coding
58. 58
5.1 How to write a Flink program in an
IDE?
The easiest way to get a working setup to
develop (and locally execute) Flink programs
is to follow the Quick Start guide:
https://ci.apache.org/projects/flink/flink-docs-master/quickstart/java_api_quickstart.html
https://ci.apache.org/projects/flink/flink-docs-master/quickstart/scala_api_quickstart.html
It uses Maven archetype to configure and
generate a Flink Maven project.
This will save you time dealing with transitive
dependencies!
This Maven project can be imported into your
IDE.
59. 59
5.2. How to generate a skeleton Flink
project with Maven?
Generate a skeleton flink-quickstart-java Maven project
to get started with no need to manually download any
.tgz or .jar files
Option 1: $ curl http://flink.apache.org/q/quickstart.sh |
bash
A sample quickstart Flink Job will be created:
• Switch into the directory using: cd quickstart
• Import the project there using your favorite IDE
(Import it as a maven project)
• Build a jar inside the directory using: mvn clean
package
• You will find the runnable jar in quickstart/target
60. 60
5.2. How to generate a skeleton Flink
project with Maven?
Option 2: Type the command below to create a flink-
quickstart-java or flink-quickstart-scala project and
specify Flink version
mvn archetype:generate /
-DarchetypeGroupId=org.apache.flink /
-DarchetypeArtifactId=flink-quickstart-java /
-DarchetypeVersion=0.10.0
you can also put “quickstart-
scala” here
you can also put “quickstart-
scala” here
or “0.1.0-SNAPSHOT”or “0.1.0-SNAPSHOT”
61. 61
5.2. How to generate a skeleton Flink
project with Maven?
The generated projects are located in a folder
called flink-java-project or flink-scala-project.
In order to test the generated projects and to download
all required dependencies run the following commands
(change flink-java-project to flink-scala-project for
Scala projects)
• cd flink-java-project
• mvn clean package
Maven will now start to download all required
dependencies and build the Flink quickstart project.
62. 62
5.3 How to import the Flink Maven project
into IDE
The generated Maven project needs to be imported into
your IDE:
IntelliJ:
• Select “File” -> “Import Project”
• Select root folder of your project
• Select “Import project from external model”,
select “Maven”
• Leave default options and finish the import
Eclipse:
• Select “File” -> “Import” -> “Maven” -> “Existing Maven
Project”
• Follow the import instructions
63. 63
5.4 How to use logging?
The logging in Flink is implemented using the slf4j
logging interface. log4j is used as underlying logging
framework.
Log4j is controlled using property file usually
called log4j.properties. You can pass to the JVM the
filename and location of this file using
the Dlog4j.configuration= parameter.
The loggers using slf4j are created by calling
import org.slf4j.LoggerFactory
import org.slf4j.Logger
Logger LOG = LoggerFactory.getLogger(Foobar.class)
You can also use logback instead of log4j.
https://ci.apache.org/projects/flink/flink-docs-release-0.9/internals/logging.html
64. 64
5.5 FAQs & best practices related to
coding
Errors http://flink.apache.org/faq.html#errors
Usage http://flink.apache.org/faq.html#usage
Flink APIs Best Practices
https://ci.apache.org/projects/flink/flink-docs-
master/apis/best_practices.html
Thanks!
Editor's Notes
The following steps assume a UNIX-like environment. For Windows, see Flink on Windows: https://ci.apache.org/projects/flink/flink-docs-master/setup/local_setup.html#flink-on-windows
The following steps assume a UNIX-like environment. For Windows, see Flink on Windows: https://ci.apache.org/projects/flink/flink-docs-master/setup/local_setup.html#flink-on-windows
For Windows, see Flink on Windows: https://ci.apache.org/projects/flink/flink-docs-master/setup/local_setup.html#flink-on-windows
For Windows, see Flink on Windows: https://ci.apache.org/projects/flink/flink-docs-master/setup/local_setup.html#flink-on-windows
This is Slide 5 of http://www.slideshare.net/robertmetzger1/apache-flink-hands-on
This is Slide 5 of http://www.slideshare.net/robertmetzger1/apache-flink-hands-on
We pass the filename and location of this file using the -Dlog4j.configuration= parameter to the JVM.
We pass the filename and location of this file using the -Dlog4j.configuration= parameter to the JVM.