Introduction to Spark (Intern Event Presentation)

An introduction to Apache Spark from its creator, Matei Zaharia, for the intern event hosted by Databricks.

Software

Introduction to Spark
Matei Zaharia
Databricks Intern Event, August 2015

What is Apache Spark?
Fast and general computing engine for clusters
Makes it easy and fast to process large datasets
• APIs in Java, Scala, Python, R
• Libraries for SQL, streaming, machine learning, …
• 100x faster than Hadoop MapReduce for some apps

About Databricks
Founded by creators of Spark in 2013
Oﬀers a hosted cloud service built on Spark
• Interactive workspace with notebooks, dashboards, jobs

0
20
40
60
80
100
120
140
160
2010 2011 2012 2013 2014 2015
Contributors
Contributors / Month to Spark
Community Growth
Most active open source project in
big data

Spark Programming Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
• Collections of objects stored in memory or disk across a cluster
• Built via parallel transformations (map, filter, …)
• Automatically rebuilt on failure

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines
=
spark.textFile(“hdfs://...”)

errors
=
lines.filter(lambda
s:
s.startswith(“ERROR”))

messages
=
errors.map(lambda
s:
s.split(‘t’)[2])

messages.cache()

Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda
s:
“MySQL”
in
s).count()

messages.filter(lambda
s:
“Redis”
in
s).count()

.
.
.

tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20s for on-disk data)

Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
Iterative algorithm used in machine learning

Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
On-Disk Performance
Time to sort 100TB

Higher-Level Libraries
Spark
Spark
Streaming
real-time
Spark SQL
structured data
MLlib
machine
learning
GraphX
graph

Higher-Level Libraries
//
Load
data
using
SQL

points
=
ctx.sql(“select
latitude,
longitude
from
tweets”)

//
Train
a
machine
learning
model

model
=
KMeans.train(points,
10)

//
Apply
it
to
a
stream

sc.twitterStream(...)

.map(lambda
t:
(model.predict(t.location),
1))

.reduceByWindow(“5s”,
lambda
a,
b:
a
+
b)

Over 1000 production users, clusters up to 8000 nodes
Many talks online at spark-summit.org
Spark Community

Ongoing Work
Speeding up Spark through code generation and
binary processing (Project Tungsten)
R interface to Spark (SparkR)
Real-time machine learning library
Frontend and backend work in Databricks
(visualization, collaboration, auto-scaling, …)

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Apache Spark overview

DataArt

Spark

Koushik Mondal

Apache Spark Introduction

sudhakara st

Introduction to Apache Spark

datamantra

Apache Spark Architecture

Alexey Grishchenko

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Spark architecture

GauravBiswas9

Simplifying Big Data Analytics with Apache SparkDatabricks

Scaling your Data Pipelines with Apache Spark on Kubernetes

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: 1) Big Data Analytics 2) What is Apache Spark? 3) Why Apache Spark? 4) Using Spark with Hadoop 5) Apache Spark Features 6) Apache Spark Architecture 7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX 8) Demo: Analyze Flight Data Using Apache Spark

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Introduction to spark

Home

Apache Spark PDF

Naresh Rupareliya

Hadoop MapReduce Fundamentals

Lynn Langit

Programming in Spark using PySpark

Mostafa

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

Big Data Processing with Spark and Scala

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Using pySpark with Google Colab & Spark 3.0 preview

Mario Cartia

Apache Spark is the Big Data opensource framework used by the world's leading companies for the implementation of advanced analytics. The talk will introduce the architecture, the modules and the main functionalities of the framework showing some practical examples in Python that do not require the installation of any software on your machine using the Colab tool made available by Google. The talk will also introduce the new features available on the preview version of the upcoming release 3.0

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

DevOps for Databricks

Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents. We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Apache spark

shima jafari

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Spark Under the Hood - Meetup @ Data Science London

What's hot

Spark SQL

Joud Khattab

Apache Spark Fundamentals

Zahra Eskandari

Memory Management in Apache Spark

Spark architecture

GauravBiswas9

Simplifying Big Data Analytics with Apache SparkDatabricks

Scaling your Data Pipelines with Apache Spark on Kubernetes

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Introduction to spark

Home

Apache Spark PDF

Naresh Rupareliya

Hadoop MapReduce Fundamentals

Lynn Langit

Programming in Spark using PySpark

Mostafa

Big Data Processing with Spark and Scala

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Using pySpark with Google Colab & Spark 3.0 preview

Mario Cartia

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

DevOps for Databricks

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Apache spark

shima jafari

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

What's hot (20)

Spark SQL

Apache Spark Fundamentals

Memory Management in Apache Spark

Spark architecture

Simplifying Big Data Analytics with Apache Spark

Scaling your Data Pipelines with Apache Spark on Kubernetes

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Introduction to spark

Apache Spark PDF

Hadoop MapReduce Fundamentals

Programming in Spark using PySpark

Big Data Processing with Spark and Scala

Apache Spark in Depth: Core Concepts, Architecture & Internals

Using pySpark with Google Colab & Spark 3.0 preview

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

DevOps for Databricks

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Apache spark

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Viewers also liked

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Spark Under the Hood - Meetup @ Data Science London

2016 Spark Summit East Keynote: Matei Zaharia

Apache Spark 2.0: Faster, Easier, and Smarter

In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release. The major themes for Spark 2.0 are: - Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs - Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries. - Tungsten Phase 2: Speed up Apache Spark by 10X

Parallelizing Existing R Packages with SparkR

R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR. Speaker: Hossein Falaki This talk was originally presented at Spark Summit East 2017.

Internship presentationsamcrosier

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Data Con LA

Spark and the Berkeley Data Analytics Stack (BDAS) represent a unified, distributed, and parallel high-performance big data processing and analytics platform. Written in Scala, Spark supports multiple languages including Python, Java, Scala, and even R. Commonly seen as the successor to Hadoop, Spark is fully compatible with Hadoop including UDFs, SerDe’s, file formats, and compression algorithms. The high-level Spark libraries include stream processing, machine learning, graph processing, approximating, sampling - and every combination therein. The most active big data open source project in existence, Spark boasts ~500 of contributors and 10,000 commits to date. Spark recently broke the Daytona GraySort 100 TB record with almost 3 times the throughput, 1/3rd less time, and 1/10th of the resources!

Spark - The beginnings

Daniel Leon

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Chris Fregly

Apache Spark

Mahdi Esmailoghli

Introduction to Apache Spark

Anastasios Skarlatidis

Intro to Apache Spark

Cloudera, Inc.

Apache spark linkedin

Yukti Kaura

New directions for Apache Spark in 2015

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

StampedeCon

At StampedeCon 2014, Tom Wheeler (Cloudera) presented, "The Evolution of Data Analysis with Hadoop." This session will lead the audience through the evolution of data analysis in Hadoop to illustrate its progression from the original low-level, batch-oriented MapReduce approach to today’s higher-level interactive tools that require very little technical knowledge. We’ll discuss Apache Crunch, Hive, Impala and Solr. While the nature of this talk is somewhat technical, no prior knowledge of Hadoop or any specific programming language is required. Frequent live demonstrations of the tools discussed will emphasize that analyzing data in Hadoop can be as easy as using a relational database or Internet search engine.

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Chris Fregly

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Spark Summit

Enabling Exploratory Analysis of Large Data with Apache Spark and R

R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets. In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.

Apache spark - Spark's distributed programming model

Martin Zapletal

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Chris Fregly

Viewers also liked (20)

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Spark Under the Hood - Meetup @ Data Science London

2016 Spark Summit East Keynote: Matei Zaharia

Apache Spark 2.0: Faster, Easier, and Smarter

Parallelizing Existing R Packages with SparkR

Internship presentation

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Spark - The beginnings

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Apache Spark

Introduction to Apache Spark

Intro to Apache Spark

Apache spark linkedin

New directions for Apache Spark in 2015

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Apache spark - Spark's distributed programming model

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Similar to Introduction to Spark (Intern Event Presentation)

Unified Big Data Processing with Apache Spark

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF. Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com. Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

Unified Big Data Processing with Apache Spark (QCON 2014)

Vienna Data Science Group

While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them. Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.

End-to-end Data Pipeline with Apache SparkDatabricks

20170126 big data processing

Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed: • What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them? • When to use batch and when stream processing? • What is a Lambda-Architecture and a Kappa Architecture? • What are the best practices for your project?

Artigo 81 - spark_tutorial.pdf

WalmirCouto3

Apache Spark Overview @ ferret

Andrii Gakhov

Paris Data Geek - Spark Streaming

Djamel Zouaoui

Jump Start with Apache Spark 2.0 on Databricks

Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: What’s new in Spark 2.0 SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

20130912 YTC_Reynold Xin_Spark and Shark

YahooTechConference

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

DataStax Academy

Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.

An introduction To Apache Spark

Amir Sedighi

Spark streaming , Spark SQL

Yousun Jeong

Apache Spark RDDs

Dean Chen

Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala. http://www.meetup.com/Scala-Bay/events/209740892/

Apache Spark & Hadoop

MapR Technologies

http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now. That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. Keys Botzum - Senior Principal Technologist with MapR Technologies Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Apache spark-melbourne-april-2015-meetup

Ned Shawa

In Memory Analytics with Apache Spark

Venkata Naga Ravi

Spark Study Notes

Richard Kuo

Azure Databricks is Easier Than You Think

Ike Ellis

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

IT Event

In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general. This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

Аліна Шепшелей

Similar to Introduction to Spark (Intern Event Presentation) (20)

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark (QCON 2014)

End-to-end Data Pipeline with Apache Spark

20170126 big data processing

Artigo 81 - spark_tutorial.pdf

Apache Spark Overview @ ferret

Paris Data Geek - Spark Streaming

Jump Start with Apache Spark 2.0 on Databricks

20130912 YTC_Reynold Xin_Spark and Shark

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

An introduction To Apache Spark

Spark streaming , Spark SQL

Apache Spark RDDs

Apache Spark & Hadoop

Apache spark-melbourne-april-2015-meetup

In Memory Analytics with Apache Spark

Spark Study Notes

Azure Databricks is Easier Than You Think

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

More from Databricks

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Sawtooth Windows for Feature Aggregations

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Machine Learning CI/CD for Email Attack Detection

Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models. In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Sarah: CEO-Finance-Report pipeline seems to be slow today. Why Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days. Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert. We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example: Tom: I am not seeing any data for today in my Campaign Metrics Dashboard. Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021. This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Recently uploaded

Accelerate Enterprise Software Engineering with Platformless

WSO2

Key takeaways: Challenges of building platforms and the benefits of platformless. Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience. How Choreo enables the platformless experience. How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo. Demo of an end-to-end app built and deployed on Choreo.

Advanced Flow Concepts Every Developer Should Know

Peter Caitens

Why React Native as a Strategic Advantage for Startup Innovation.pdf

ayushiqss

Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework. In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill. But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app. Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.

top nidhi software solution freedownload

vrstrong314

This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...

Hivelance Technology

Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots

Understanding Globus Data Transfers with NetSage

Tendenci - The Open Source AMS (Association Management Software)

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

Corporate Management | Session 3 of 3 | Tendenci AMS

Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have. For more Tendenci AMS events, check out www.tendenci.com/events

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

takuyayamamoto1800

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Shahin Sheidaei

Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.

Cyaniclab : Software Development Agency Portfolio.pdf

Cyanic lab

CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.

Large Language Models and the End of Programming

Matt Welsh

GlobusWorld 2024 Opening Keynote session

De mooiste recreatieve routes ontdekken met RouteYou en FME

Jelle | Nordend

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...

Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.

First Steps with Globus Compute Multi-User Endpoints