Slides from my talk at 2016 Apache: Big Data conference.
Resource managers like Apache YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. We present Apache REEF, a powerful yet simple framework that helps developers of big data systems to retain fine-grained control over the cloud resources and address common problems of fault-tolerance, task scheduling and coordination, caching, interprocess communication, and bulk-data transfers. We will guide the developers through a simple REEF application and discuss current state of Apache REEF project and its place in the Hadoop ecosystem.
Topic Modeling via Tensor Factorization - Use Case for Apache REEFSergiy Matusevych
Slides from my talk at 2015 Hadoop Summit.
Topic modeling – a task of discovering hidden thematic structure in a set of documents – is an important problem of modern machine learning. Despite great progress in recent years, scaling it for the large number of topics and massive corpora remains a challenge. We share our experiences in building a high performance distributed system for topic modeling using Apache REEF framework. We start with introduction into topic modeling via Latent Dirichlet Allocation, and describe our new LDA inference algorithm based on tensor factorization. Finally, we demonstrate how using Apache REEF framework helped us to implement the system in a few lines of clean readable code.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Topic Modeling via Tensor Factorization - Use Case for Apache REEFSergiy Matusevych
Slides from my talk at 2015 Hadoop Summit.
Topic modeling – a task of discovering hidden thematic structure in a set of documents – is an important problem of modern machine learning. Despite great progress in recent years, scaling it for the large number of topics and massive corpora remains a challenge. We share our experiences in building a high performance distributed system for topic modeling using Apache REEF framework. We start with introduction into topic modeling via Latent Dirichlet Allocation, and describe our new LDA inference algorithm based on tensor factorization. Finally, we demonstrate how using Apache REEF framework helped us to implement the system in a few lines of clean readable code.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
Apache Hadoop has become popular from its specialization in the execution of MapReduce programs. However, it has been hard to leverage existing Hadoop infrastructure for various other processing paradigms such as real-time streaming, graph processing and message-passing. That was true until the introduction of Apache Hadoop YARN in Apache Hadoop 2.0. YARN supports running arbitrary processing paradigms on the same Hadoop cluster. This allows for development of newer frameworks as well as more efficient implementations of existing frameworks that can all run on and share the resources of a single multi-tenant YARN cluster. This talk gives a brief introduction to YARN. We will illustrate how to create applications and how to best make use of YARN. We will show examples of different applications such as Apache Tez and Apache Samza that can leverage YARN and present best practices/guidelines on building applications on top of Apache Hadoop YARN.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
NOTE: This was converted to Powerpoint from Keynote. Slideshare does not play the embedded videos. You can download the powerpoint from slideshare and import it into keynote. The videos should work in the keynote.
Abstract:
In this presentation, we will describe the "Spark Kernel" which enables applications, such as end-user facing and interactive applications, to interface with Spark clusters. It provides a gateway to define and run Spark tasks and to collect results from a cluster without the friction associated with shipping jars and reading results from peripheral systems. Using the Spark Kernel as a proxy, applications can be hosted remotely from Spark.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
Apache Hadoop has become popular from its specialization in the execution of MapReduce programs. However, it has been hard to leverage existing Hadoop infrastructure for various other processing paradigms such as real-time streaming, graph processing and message-passing. That was true until the introduction of Apache Hadoop YARN in Apache Hadoop 2.0. YARN supports running arbitrary processing paradigms on the same Hadoop cluster. This allows for development of newer frameworks as well as more efficient implementations of existing frameworks that can all run on and share the resources of a single multi-tenant YARN cluster. This talk gives a brief introduction to YARN. We will illustrate how to create applications and how to best make use of YARN. We will show examples of different applications such as Apache Tez and Apache Samza that can leverage YARN and present best practices/guidelines on building applications on top of Apache Hadoop YARN.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
NOTE: This was converted to Powerpoint from Keynote. Slideshare does not play the embedded videos. You can download the powerpoint from slideshare and import it into keynote. The videos should work in the keynote.
Abstract:
In this presentation, we will describe the "Spark Kernel" which enables applications, such as end-user facing and interactive applications, to interface with Spark clusters. It provides a gateway to define and run Spark tasks and to collect results from a cluster without the friction associated with shipping jars and reading results from peripheral systems. Using the Spark Kernel as a proxy, applications can be hosted remotely from Spark.
High availability of data across geographic regions for search and analytical applications is a challenging task. Mission critical applications need effective failover strategies across data centers. Apache Solr offers Cross Data Center Replication (CDCR) as a feature from 6.0 and has added more features in subsequent releases.
The first part of session will center on an active-passive design model with one data-center as the primary and other data-centers as secondary clusters. The second design model centers on designing an active-active bidirectional setup such that both querying and indexing traffic can gracefully be redirected to the failover cluster.
The third part of session will center on an actual use case: An analytics application with high availability. We will discuss the improvements observed in terms of maintenance, performance, and throughput.
The session concludes with challenges and/or limitations in the current design and what improvements are forthcoming for Cross Data Center Replication in Apache Solr.
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...DevOpsDays Tel Aviv
One of the critical factors for development velocity is software correctness. Our ability to develop and ship new features fast is bounded by our ability to validate several aspects of the change: * Does the feature meet the requirements? * How does the feature affect existing code, and how can it affect the production environment? With continues codebase growth and new features being added, naturally our productivity decreases, and our need to improve the guarantees for quality and correctness increase.
In this talk, I’ll focus on testing environments: why developers need a self-serve platform to create a full functioning environment on-demand, how such environments should be managed, and how can one restore part of the lost velocity. I’ll cover an internal system we use at AppsFlyer called ‘Namespaces’ that addresses the issue with the help of Mesos / Marathon, Docker, Traefik, and Consul.
Through the magic of virtualization technology (Vagrant) and Puppet, a companion Enterprise grade provisioning technology, we explore how to make the complex configuration game a walk in the park. Bring new team members up to speed in minutes, eliminate variances in configurations, and make integration issues a thing of the past.
Welcome to the new age of team development!
Everything you wanted to know about writing async, concurrent http apps in java Baruch Sadogursky
As presented at CodeMotion Tel Aviv:
Facing tens of millions of clients continuously downloading binaries from its repositories, JFrog decided to offer an OSS client that natively supports these downloads. This session shares the main challenges of developing a highly concurrent, resumable, async download library on top of an Apache HTTP client. It also covers other libraries JFrog tested and why it decided to reinvent the wheel. Consider yourself forewarned: lots of HTTP internals, NIO, and concurrency ahead!
"In this session, Twitter engineer Alex Payne will explore how the popular social messaging service builds scalable, distributed systems in the Scala programming language. Since 2008, Twitter has moved the development of its most critical systems to Scala, which blends object-oriented and functional programming with the power, robust tooling, and vast library support of the Java Virtual Machine. Find out how to use the Scala components that Twitter has open sourced, and learn the patterns they employ for developing core infrastructure components in this exciting and increasingly popular language."
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"Daniel Bryant
It’s easy to get seduced by being able to quickly deploy and scale applications by using containers. However, when things inevitably go wrong, how do you debug your application? This session covers various pro bug hunting tips and tricks. It shows live demos of tools such as the Docker stats API, Docker exec (and top, vmstat, and netstat), and how to use the ELK stack for centralized logging. It also dives into other more sophisticated tools that operate at the application and (micro)service layer, such as Twitter’s Zipkin tracing app, Spring Boot’s Actuator, and DropWizard’s Metrics library. Keep those container-based nightmares away by ensuring that when the worst does happen, you have the tools, info, and experience to debug containerized applications.
Presented at JavaOne 2015 with Steve Poole
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward
Flink's streaming API can be used to construct a scalable, fault tolerant framework for buffering high frequency time series data, with the goal being to output larger, immutable blocks of data. As the data is being buffered into larger blocks, Flink's queryable state feature can be used to service requests for data still in the "buffering" state. The high frequency time series data set in this example is electro cardiogram data (EKG) that is buffered from a sample rate in millisecond into multi minute blocks.
Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...Amazon Web Services
In this session, we introduce you to a solution for easily running a Docker-powered microservices architecture on AWS using Elastic Beanstalk. We will also cover the fundamentals of Elastic Beanstalk and how it benefits developers looking for a quick and scalable way to get their applications running on AWS with no infrastructure work required.
Building a microservices architecture using Docker can require a lot of work, from launching and operating the underlying infrastructure to installing and maintaining cluster management software. With AWS Elastic Beanstalk’s multicontainer support feature, many of these tasks are simplified and abstracted away so you can focus on your application code. AWS Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications and services developed with Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker."
Learning Objectives:
• Learn the basics of AWS Elastic Beanstalk
• Understand how to use Elastic Beanstalk to run containerized applications
• Learn how to use Elastic Beanstalk to start architecting microservices-based applications
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
2. a library to simplify and unify the lower layers
of big data systems on modern resource managers.
3. • One cluster used by all workloads
(interactive, batch, streaming, …)
• Resources are handed out as
containers
Container is slice of a machine
Fixed RAM, CPU, I/O, …
• Examples:
Apache Hadoop YARN
Apache Mesos
Google Borg
Resource
Managers
14. Silos
Silos are hard to build
Each duplicates the same
mechanisms under the hood
In practice, silos form pipelines
• In each step:
Read from and write to HDFS
• Synchronize on complete data
between steps
Slow!
σ
π
⋈
15. Fun new problems,
boring old ones
Control flow setup
Master / Slave
Membership
Via heartbeats
Task Submission
Inter-Task Messaging
…
σ
π
⋈
16. REEF
Breadth
Mechanism over Policy
Avoid silos
Recognize the need
for different models
But: allow them to be composed
RM portability
Make app independent from low-
level Resource Manager APIs.
Bridge the JVM/CLR/… divide
Different parts of the computation
can be in either of them.
Resource Manager and DFS := Cluster OS
REEF := stdlib
21. REEF Control Flow
Yarn ( ) handles resource
management (security, quotas,
priorities)
Per-job Drivers ( ) request
resources, coordinate computations,
and handle events: faults,
preemption, etc.
+ 3
22. REEF Control Flow
Yarn ( ) handles resource
management (security, quotas,
priorities)
Per-job Drivers ( ) request
resources, coordinate computations,
and handle events: faults,
preemption, etc…
REEF Evaluators ( ) hold hardware
resources, allowing multiple Tasks
( , , , , , , etc…) to
use the same cached state.
πσ
+ 3
23. REEF Control Flow
Yarn ( ) handles resource
management (security, quotas,
priorities)
Per-job Drivers ( ) request
resources, coordinate computations,
and handle events: faults,
preemption, etc…
REEF Evaluators ( ) hold hardware
resources, allowing multiple Tasks
( , , , , , , etc…) to
use the same cached state.
πσ
24. REEF Control Flow
Yarn ( ) handles resource
management (security, quotas,
priorities)
Per-job Drivers ( ) request
resources, coordinate computations,
and handle events: faults,
preemption, etc…
REEF Evaluators ( ) hold hardware
resources, allowing multiple Tasks
( , , , , , , etc…) to
use the same cached state.
πσ
25. REEF Control Flow
Yarn ( ) handles resource
management (security, quotas,
priorities)
Per-job Drivers ( ) request
resources, coordinate computations,
and handle events: faults,
preemption, etc…
REEF Evaluators ( ) hold hardware
resources, allowing multiple Tasks
( , , , , , , etc…) to
use the same cached state.
πσ
26. REEF Control Flow
Yarn ( ) handles resource
management (security, quotas,
priorities)
Per-job Drivers ( ) request
resources, coordinate computations,
and handle events: faults,
preemption, etc…
REEF Evaluators ( ) hold hardware
resources, allowing multiple Tasks
( , , , , , , etc…) to
use the same cached state.
πσ
σ σ
σ
27. REEF Control Flow
Yarn ( ) handles resource
management (security, quotas,
priorities)
Per-job Drivers ( ) request
resources, coordinate computations,
and handle events: faults,
preemption, etc…
REEF Evaluators ( ) hold hardware
resources, allowing multiple Tasks
( , , , , , , etc…) to
use the same cached state.
πσ
28.
29.
30. Hello World
1. Client submits the Driver
allocation request to the
Resource Manager ( )
$…
31. Hello World
1. Client submits the Driver
allocation request to the
Resource Manager ( )
2. Once started, Driver ( )
requests one Evaluator
container from YARN
$…
32. Hello World
1. Client submits the Driver
allocation request to the
Resource Manager ( )
2. Once started, Driver ( )
requests one Evaluator
container from YARN
3. Driver submits HelloWorld
Task to the newly allocated
Evaluator
$…
33. Hello World
1. Client submits the Driver
allocation request to the
Resource Manager ( )
2. Once started, Driver ( )
requests one Evaluator
container from YARN
3. Driver submits a HelloWorld
Task to the newly allocated
Evaluator
4. HelloWorld Task prints a
greeting to the log and quits
$…
LOG.log(Level.INFO, "Hello REEF!");
34. Hello World
1. Client submits the Driver
allocation request to the
Resource Manager ( )
2. Once started, Driver ( )
requests one Evaluator
container from YARN
3. Driver submits a HelloWorld
Task to the newly allocated
Evaluator
4. HelloWorld Task prints a
greeting to the log and quits
5. Driver receives CompletedTask
notification, releases the
Evaluator and quits $…
35. public final class HelloTask implements Task {
@Inject
private HelloTask() {}
@Override
public byte[] call(final byte[] memento) {
LOG.log(Level.INFO, "Hello REEF!");
return null;
}
}
36. public final class StartHandler implements EventHandler<StartTime> {
@Override
public void onNext(final StartTime startTime) {
requestor.submit(EvaluatorRequest.newBuilder()
.setNumber(1).setMemory(64).setNumberOfCores(1).build());
}
}
public final class EvaluatorAllocatedHandler
implements EventHandler<AllocatedEvaluator> {
@Override
public void onNext(final AllocatedEvaluator allocatedEvaluator) {
allocatedEvaluator.submitTask(TaskConfiguration.CONF
.set(TaskConfiguration.IDENTIFIER, "HelloREEFTask")
.set(TaskConfiguration.TASK, HelloTask.class)
.build();
}
}
39. Tang
Command = ‘ls’
Error:
Unknown parameter "Command"
Missing required parameter "cmd"
Configuration is hard
- Errors often show up at runtime only
- State of receiving process is
unknown to the configuring process
Our approach:
- Use Dependency Injection
- Configuration is pure data
Early static and dynamic checks
40. Tang
cmd = ‘ls’
ShellTask
Evaluator
Error:
Required instanceof Evaluator
Got ShellTask
Configuration is hard
- Errors often show up at runtime only
- State of receiving process is
unknown to the configuring process
Our approach:
- Use Dependency Injection
- Configuration is pure data
Early static and dynamic checks
41. Tang
cmd = ‘ls’
ShellTask
Task
YarnEvaluator
Evaluator
Configuration is hard
- Errors often show up at runtime only
- State of receiving process is
unknown to the configuring process
Our approach:
- Use Dependency Injection
- Configuration is pure data
Early static and dynamic checks
42. Wake:
Events + I/O
Event-based
programming and remoting
API: A static subset of Rx
- Static checking of event flows
- Aggressive JVM event inlining
Implementation: “SEDA++”
- Global thread pool
- Thread sharing where possible
43.
44. Distributed Shell
1. Client submits Driver configuration
to YARN runtime. Configuration
has cmd shell command, and n
number of Evaluators to run it on
$…
45. Distributed Shell
1. Client submits Driver configuration
to YARN runtime. Configuration
has cmd shell command, and n
number of Evaluators to run it on
2. Once started, Driver requests n
Evaluators from YARN
$…
46. Distributed Shell
1. Client submits Driver configuration
to YARN runtime. Configuration
has cmd shell command, and n
number of Evaluators to run it on
2. Once started, Driver requests n
Evaluators from YARN
3. Driver submits ShellTask with cmd
to each Evaluator
#!
$…
#!
#!
47. Distributed Shell
1. Client submits Driver configuration
to YARN runtime. Configuration
has cmd shell command, and n
number of Evaluators to run it on
2. Once started, Driver requests n
Evaluators from YARN
3. Driver submits ShellTask with cmd
to each Evaluator
4. Each ShellTask runs the command,
logs its stdout, and quits
#!
$…
#!
#!
48. Distributed Shell
1. Client submits Driver configuration
to YARN runtime. Configuration
has cmd shell command, and n
number of Evaluators to run it on
2. Once started, Driver requests n
Evaluators from YARN
3. Driver submits ShellTask with cmd
to each Evaluator
4. Each ShellTask runs the command,
logs its stdout, and quits
5. Driver receives CompletedTask
notifications, releases Evaluators
and quits when all Evaluators are
gone
$…
49. @Inject
private ShellTask(@Parameter(Command.class) final String command) {
this.command = command;
}
@Override
public byte[] call(final byte[] memento) {
final String result = CommandUtils.runCommand(this.command);
LOG.log(Level.INFO, result);
return CODEC.encode(result);
}
50. @NamedParameter(doc="Number of evaluators", short_name="n", default_value="1")
public final class NumEvaluators implements Name<Integer> {}
@NamedParameter(doc="The shell command", short_name="cmd")
public final class Command implements Name<String> {}
51. public final class EvaluatorAllocatedHandler
implements EventHandler<AllocatedEvaluator> {
@Override
public void onNext(final AllocatedEvaluator allocatedEvaluator) {
final JavaConfigurationBuilder taskConfigBuilder =
tang.newConfigurationBuilder(TaskConfiguration.CONF
.set(TaskConfiguration.IDENTIFIER, "ShellTask")
.set(TaskConfiguration.TASK, ShellTask.class)
.build());
taskConfigBuilder.bindNamedParameter(Command.class, command);
allocatedEvaluator.submitTask(taskConfigBuilder.build());
}
}
59. Broadcast
Mechanism
The sender (S) sends a value
All receivers (R) receive the identical value
Use case
Distribute a model
Distribute a descent direction for line search
Optimizations
Trees to distribute the data
Be mindful of the topology of machines
Do peer-to-peer sends
…
S
60. Reduce
Mechanism
The senders (S) each send a value
Those values are aggregated
The receiver (R) receives the aggregate total
Use case
Aggregate gradients, losses
Optimizations
Aggregation trees
Pipelining
61.
62.
63. // We can send and receive any Java serializable data, e.g. jBLAS matrices
private final Broadcast.Sender<DoubleMatrix> modelSender;
private final Broadcast.Receiver<DoubleMatrix[]> resultReceiver;
// Broadcast the model, collect the results, repeat.
do {
this.modelSender.send(modelSlice);
// ...
final DoubleMatrix[] result = this.resultReceiver.reduce();
} while (notConverged(modelSlice, prevModelSlice));
70. Start with a random 𝑤0
Until convergence:
Compute the gradient
𝜕 𝑤 =
𝑥,𝑦 𝜖 𝑋
𝑤, 𝑥 − 𝑦
Apply gradient and regularizer to the model
𝑤𝑡+1 = 𝑤𝑡 − 𝜂 𝜕 𝑤 + 𝜆
Data parallel in X
Reduce
Needed by Partitions
Broadcast
73. On REEF
Driver requests Evaluators
Driver sends Tasks to load & parse
data
Driver sends ComputeGradient and
master Tasks
74. On REEF
Driver requests Evaluators
Driver sends Tasks to load & parse
data
Driver sends ComputeGradient and
master Tasks
Computation commences in
sequence of Broadcast and Reduce
77. Start with a random 𝑤0
Until convergence:
Compute the gradient
𝜕 𝑤 =
𝑥,𝑦 𝜖 𝑋
2 𝑤, 𝑥 − 𝑦
Apply gradient to the model
𝑤𝑡+1 = 𝑤𝑡 − 𝜕 𝑤
Not having some machines means
training on a (random) subset of X