This document summarizes the agenda for a Zeppelin Meetup. The agenda includes demos of real-time streaming and running Zeppelin on Kubernetes, as well as discussing the Zeppelin roadmap and taking questions. The roadmap focuses on modernizing the front-end, adding collaboration features, running applications alongside notebooks on Kubernetes, and improving visualization capabilities.
Secrets of Performance Tuning Java on KubernetesBruno Borges
Java on Kubernetes may seem complicated, but after a bit of YAML and Dockerfiles, you will wonder what all that fuss was. But then the performance of your app in 1 CPU/1 GB of RAM makes you wonder. Learn how JVM ergonomics, CPU throttling, and GCs can help increase performance while reducing costs.
Secrets of Performance Tuning Java on KubernetesBruno Borges
Java on Kubernetes may seem complicated, but after a bit of YAML and Dockerfiles, you will wonder what all that fuss was. But then the performance of your app in 1 CPU/1 GB of RAM makes you wonder. Learn how JVM ergonomics, CPU throttling, and GCs can help increase performance while reducing costs.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Overview of Chef - Fundamentals Webinar Series Part 1Chef
This is an Overview of Chef. After viewing this webinar you will be able to:
- Describe how Chef thinks about Infrastructure Automation
- Define the following terms:
- Resource
- Recipe
- Node
- Run List
- Search
- Login to Hosted Chef
- Run `knife` commands from your workstation
Video of this webinar can be found at the following URL
https://www.youtube.com/watch?v=S5lHUpzoCYo&list=PL11cZfNdwNyPnZA9D1MbVqldGuOWqbumZ
Spring Framework Petclinic sample applicationAntoine Rey
Spring Petclinic is a sample application that has been designed to show how the Spring Framework can be used to build simple but powerful database-oriented applications.
The fork named Spring Framework Petclinic maintains a version both with a plain old Spring Framework configuration and a 3-layer architecture (i.e. presentation --> service --> repository).
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...Amazon Web Services Korea
발표영상 다시보기: https://youtu.be/hPvBst9TPlI
S3 기반의 데이터레이크에서 대량의 데이터 변환과 처리에 사용될 수 있는 가장 대표적인 솔루션이 Apache Spark 입니다. EMR 플랫폼 환경에서 쉽게 적용 가능한 Apache Spark의 성능 향상 팁을 소개합니다. 또한 데이터의 레코드 레벨 업데이트, 리소스 확장, 권한 관리 및 모니터링과 같은 다양한 데이터 워크로드 관리 최적화 방안을 함께 살펴봅니다.
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
고승범(peter.ko) / kakao corp.(인프라2팀)
---
카카오에서는 빅데이터 분석, 처리부터 모든 개발 플랫폼을 이어주는 솔루션으로 급부상한 카프카(kafka)를 전사 공용 서비스로 운영하고 있습니다. 전사 공용 카프카를 직접 운영하면서 경험한 트러블슈팅과 운영 노하우 등을 공유하고자 합니다. 특히 카프카를 처음 접하시는 분들이나 이미 사용 중이신 분들이 많이 궁금해하는 프로듀서와 컨슈머 사용 시의 주의점 등에 대해서도 설명합니다.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Why Airflow? & What's new in Airflow 2.3?Kaxil Naik
Talk: https://odsc.com/speakers/whats-new-in-apache-airflow-2-3/
This session talks about Why to use Apache Airflow & the awesome new features the community has built that were recently released in Apache Airflow 2.3.
Highlights:
- Dynamic Task Mapping
- First-class support for DB Downgrades
- Pruning old DB records (No need of using Maintenance DAGs anymore)
- Building Connections using JSON
- UI Improvements
The talk will also cover the growth of Airflow Community over years and why Airflow is still the defacto tool for Workflow Orchestration.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Overview of Chef - Fundamentals Webinar Series Part 1Chef
This is an Overview of Chef. After viewing this webinar you will be able to:
- Describe how Chef thinks about Infrastructure Automation
- Define the following terms:
- Resource
- Recipe
- Node
- Run List
- Search
- Login to Hosted Chef
- Run `knife` commands from your workstation
Video of this webinar can be found at the following URL
https://www.youtube.com/watch?v=S5lHUpzoCYo&list=PL11cZfNdwNyPnZA9D1MbVqldGuOWqbumZ
Spring Framework Petclinic sample applicationAntoine Rey
Spring Petclinic is a sample application that has been designed to show how the Spring Framework can be used to build simple but powerful database-oriented applications.
The fork named Spring Framework Petclinic maintains a version both with a plain old Spring Framework configuration and a 3-layer architecture (i.e. presentation --> service --> repository).
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...Amazon Web Services Korea
발표영상 다시보기: https://youtu.be/hPvBst9TPlI
S3 기반의 데이터레이크에서 대량의 데이터 변환과 처리에 사용될 수 있는 가장 대표적인 솔루션이 Apache Spark 입니다. EMR 플랫폼 환경에서 쉽게 적용 가능한 Apache Spark의 성능 향상 팁을 소개합니다. 또한 데이터의 레코드 레벨 업데이트, 리소스 확장, 권한 관리 및 모니터링과 같은 다양한 데이터 워크로드 관리 최적화 방안을 함께 살펴봅니다.
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
고승범(peter.ko) / kakao corp.(인프라2팀)
---
카카오에서는 빅데이터 분석, 처리부터 모든 개발 플랫폼을 이어주는 솔루션으로 급부상한 카프카(kafka)를 전사 공용 서비스로 운영하고 있습니다. 전사 공용 카프카를 직접 운영하면서 경험한 트러블슈팅과 운영 노하우 등을 공유하고자 합니다. 특히 카프카를 처음 접하시는 분들이나 이미 사용 중이신 분들이 많이 궁금해하는 프로듀서와 컨슈머 사용 시의 주의점 등에 대해서도 설명합니다.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Why Airflow? & What's new in Airflow 2.3?Kaxil Naik
Talk: https://odsc.com/speakers/whats-new-in-apache-airflow-2-3/
This session talks about Why to use Apache Airflow & the awesome new features the community has built that were recently released in Apache Airflow 2.3.
Highlights:
- Dynamic Task Mapping
- First-class support for DB Downgrades
- Pruning old DB records (No need of using Maintenance DAGs anymore)
- Building Connections using JSON
- UI Improvements
The talk will also cover the growth of Airflow Community over years and why Airflow is still the defacto tool for Workflow Orchestration.
BenchFlow: A Platform for End-to-end Automation of Performance Testing and An...Vincenzo Ferme
BenchFlow is an open-source expert system providing a complete platform for automating performance tests and performance analysis. We know that not all the developers are performance experts, but in nowadays agile environment, they need to deal with performance testing and performance analysis every day. In BenchFlow, the users define objective-driven performance testing using an expressive and SUT-aware DSL implemented in YAML. Then BenchFlow automates the end-to-end process of executing the performance tests and providing performance insights, dealing with system under test deployment relying on Docker technologies, distributing simulated users load on different server, error handling, performance data collection and performance metrics and insights computation.
My talk for SPEC Research Group DevOps (https://research.spec.org/devopswg) about BenchFlow. Discover BenchFlow: https://github.com/benchflow
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
Getting to understand your Kubernetes storage capabilities is important in order to run a proper cluster in production. In this session I will demonstrate how to use Sherlock, an open source platform written to test persistent NVMe/TCP storage in Kubernetes, either via synthetic workload or via variety of databases, all easily done and summarized to give you an estimate of what your IOPS, Latency and Throughput your storage can provide to the Kubernetes cluster.
Tech Talk: DevOps at LeanIX @ Startup Camp BerlinLeanIX GmbH
DevOps at LeanIX - Presentation during Startup Camp Berlin 2015. Covering tools like Docker, Jenkins and Ansible.
===
LeanIX offers an innovative software-as-a-service solution for Enterprise Architecture Management (EAM), based either in a public cloud or the client’s data center.
Companies like Adidas, Axel Springer, Helvetia, RWE, Trusted Shops and Zalando use LeanIX Enterprise Architecture Management tool.
Free Trial: http://bit.ly/LeanIXDemoS
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
Failures are inevitable. How can we recover a Flink job from outage? How do we reprocess data from outage period? What are the implications to downstream consumers? These are important questions that we need to answer when running Flink for critical data processing applications. We implemented two solutions for our stream processing platform: (1) use data warehouse, like Hive, as backfill source (2) rewind Flink job using external checkpoint. We will describe both solutions in details, and discuss the pros and cons of each approach. We will also take a look at some of the caveats to watch out for.
DCEU 18: App-in-a-Box with Docker Application PackagesDocker, Inc.
Michael Irwin - Application Architect, Virginia Tech
Docker Application Packages is an experimental tool that makes it easy to share multi-service applications. Create a Compose file, package it in an image, and voilà! You now have an "app-in-a-box"! Not convinced yet? No worries! It took a while for me to be convinced too! In this session, we'll start off by diving into how Docker Application Packages actually works, which will help us understand the use cases. We'll see how dev environments can hook in to this app-in-a-box by replacing the service being worked on with a dev container. Then we'll move on to see how end-to-end functional tests are much easier to run. And, finally, we'll see how to maintain an "app-in-a-box" with the latest versions of each component in a CI/CD pipeline, allowing for a unique app-in-a-box for each feature branch under development. Lots of good material! And lots of live demos!
This document is a presentation from OpenStack Summit Sydney. It describes how to easily install OpenStack on Kubernetes. It explains Kubernetes and OpenStack-Helm.
The slide deck used in the Apache Camel / Syndesis Seminar at Red Hat, K.K., Ebisu --
https://jcug-oss.connpass.com/event/99168/
Uploaded with permission of Christina Lin
Are you ready for going serverless? Spring Cloud is! With the help of a brand new Spring Cloud Function project you can write code once and reuse it as a web-endpoint, a stream handler, or simply as a serverless function deployed in cloud. In this talk, Orkhan Gasimov speaks about the features of Spring Cloud Function and explains how it helps to get more productive.
This presentation was held by Orkhan Gasimov (Digital Transformation Architect, Consultant, GlobalLogic) at GlobalLogic Kyiv Java Career Day on August 11, 2018.
Learn more: https://www.globallogic.com/ua/events/globallogic-kyiv-java-career-day
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
7. 7
Benefits
MULTI-TENANCY
Each note and/or user has own
container for interpreters
SCALABILITY
Single host does not run all
interpreters anymore
SECURITY
Each container is isolated
(filesystem, process etc.)
8. 8
Usage
$ kubectl apply -f ${ZEPPELIN_HOME}/k8s/zeppelin-server.yaml
* Need to build your own Zeppelin and Spark docker image before 0.9.0 is released
1. Build Zeppelin distribution package mvn package -Pbuild-distr …
2. Build Zeppelin docker image cd scripts/docker/zeppelin/bin; docker build -t …
3. Build Spark docker image <spark-distribution>/bin/docker-image-tool.sh -m -t 2.4.0 build
Available in 0.9.0-SNAPSHOT
http://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
Run
9. 9
Zeppelin Roadmap
- Zeppelin on Kubernetes
- Apply network policy to isolate Interpreter Pod
- Schedule note on background as a Job in Kubernetes
- Run extra application such as terminal, tensorboard, the sameway SparkUI works
- Modernize front-end stack
- Currently AngularJS
- Dark theme?
- Visualization
- Realtime data visualization
- Pivot in the backend side, instead of doing it in a front-end that require transfer all data to front-end
- Sidebar
- Sidebar with widgets, such as ToC (Table of Contents, list of data, etc)
- Online widget registry (Helium)
- Collaboration
- Multi-cursor edit
- Comment!
10. 10
Zeppelin Roadmap
Modernize
front-end stack
• Currently AngularJS
• Dark theme
Zeppelin on
Kubernetes
• Apply network policy to isolate
Interpreter Pod
• Schedule note on background as a
Job in Kubernetes
• Run extra application such as
terminal, tensorboard, the sameway
SparkUI works
Collaboration
• Multi-cursor edit
• Comment!
Sidebar
• Sidebar with widgets, such as ToC
(Table of Contents, list of data, etc)
• Online widget registry (Helium)
Visualization
• Realtime data visualization
• Pivot in the backend side,
instead of doing it in a front-end
that require transfer all data to
front-end
16. 16
Problem
- Entire result dataset need to be transferred to browser, even though not all of
them are rendered.
- Browser CPU, memory is limitation of transforming / rendering data
20. 20
Related work
- Streaming data update (without refresh notebook)
- Separate transfer for result dataset and note to browser
- Partial data fetch for table display
- Extending TableData API
24. 24
Contents
1. This is notebook
a. First
b. Second
2. Next
a. Next
One of the most popular feature in Jupyter.
Google Colab also supports it.
Zeppelin has SPELL
See https://www.npmjs.com/package/zeppelin-toc-spell
TOC (table of contents) widget
25. 25
Displays list of table, schema of table, preview of data
recognized by Interpreter
Table data widget
Name Temporary
table1 no
bank yes
Tables
Column Type
age INT
job TEXT
Schema
Preview
26. 26
Drag and drop paragraph to the clipboard.
In the same or in another notebook and drag and drop
paragraph from clipboard.
Clipboard
Drop paragraph here
Paragraph a
Paragraph b