Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

•

33 likes•27,334 views

This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.

Software

Combining the Strengths
of MLlib, scikit-learn, & R
Joseph K. Bradley
Spark Summit Europe
October 2015

scikit-learn & R
Greatlibraries
• Detailed documentation & how-to guides
• Many packages& extensions
Business investment
• Education
• Tooling & workflows
5

Big Data
6
Scaling (trees)Topic model on 4.5
million Wikipedia
articles
Recommendation with
50 million users,
5 million songs,
50 billion ratings

Big Data & MLlib
• More data à higher accuracy
• Scalewith business (# users,available data)
• Integrate with production systems
7

Bridging the gap
How do you get from a single-machine workload
to a fully distributed one?
8
At school: Machine Learning with R on my laptop
The Goal: Machine Learning on a huge computing cluster

Wish list
• Run original code on a production environment
• Use distributed data sources
• Distribute ML workload piece by piece
• Only distribute as needed
• Easily switch between local & distributed settings
• Use familiar APIs
9

Our task
10
Sentiment analysis
Given a review (text),
Predict the user’srating.
Data
from
https://snap.stanford.edu/data/web-‐Amazon.html

Our ML workflow
11
Text
This scarf I
bought is
very strange.
When I ...
Label
Rating = 3.0
Tokenizer
Words
[This,
scarf,
I,
bought,
...]
Hashing
Term-Freq
Features
[2.0,
0.0,
3.0,
...]
Linear
Regression
Prediction
Rating = 2.0

Our ML workflow
12
Cross Validation
Linear
Regression
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}

Cross validation
13
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3

Cross validation
14
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3

Distribute cross validation
15
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3

Distribute feature extraction
16
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Feature
Extraction #1
Feature
Extraction #2
Feature
Extraction #3
...
Linear
Regression #2
Linear
Regression #3

Feature
Extraction #1
Distribute learning
17
Cross Validation
...
Best Linear
Regression
Feature
Extraction #2
Feature
Extraction #3
Linear
Regression #1
Linear
Regression #2
...

Improvements we observed
Also, in practice:
• More folds of Cross Validation
• Tune more parameters
• Increase model size as dataset size increases
18
1) Faster model selection for small data
2) Faster training for large data
3) Better predictions (R^2) with more data

Integrations
• Distributed data sources
• Conversionsbetween pandas& Spark
• Conversionsbetween scipy & MLlib types
• Distributed model selection
• Distributed feature extraction
• Distributed learning
• Conversionsbetween scikit-learn & MLlib models
19

Integrations with R
DataFrames
• Conversionsbetween R (local)& Spark (distributed)
• SQL queries from R
20
model <- glm(Sepal_Length ~ Sepal_Width + Species,
data = df, family = "gaussian")
head(filter(df, df$waiting < 50))
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
R-like MLlib API for generalizedlinearmodels

Repeating this at home
This demo used:
• Spark 1.5
• The pdspark Spark Package (tobe released soon!)
The code will be posted online.
Also see sparkit-learn package
21
Try it on Databricks
with a free trial
@ databricks.com

What’s next?
Further work on integrations
• Python:Support more models& data types
• R: Expand GLM formula (feature interactions) & other models
Match features & behavior
Getinvolved!
• Contribute to Spark & Spark packages
• Provide feedback
22

Thank you!
spark.apache.org
spark-packages.org
databricks.com

James has worked at Microsoft for the past year. Before that, he was an independent consultant as well as having worked as a permanent employee and contractor and numerous companies. What is different about Microsoft? What is it like to see how things work “behind the curtain”? How does it compare to what he anticipated it to be like? Come join this session to find out more working for Microsoft: benefits, compensation, training, career advancement, work-life balance, travel, types of jobs, etc. We will leave plenty of time to ask questions!

Criação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e Python

Leandro Mendes Ferreira

PostgreSQL em projetos de Business Analytics e Big Data Analytics com Pentaho

Ambiente Livre

Trends_of_MLOps_tech_in_business

SANG WON PARK

데이터를 둘러싼 정책과, 기업과 기술의 진화는 빠르게 변화하고 있으며, 모든 지향점은 기업들이 다양한 데이터를 활용하여 경쟁력을 확보하고 이를 통해 AI기반의 혁신을 하고자 하는데 있다. 이 과정에서 수 많은 기업의 업무 전무가, 데이터 사이언티스트 등이 다양한 기업의 혁신을 지원할 수 있는 AI 모델을 검증하는 과정을 거치게 됩니다. 하지만, 이렇게 수 많은 AI 모델이 실제 비즈니스에 적용되기 위해서는 인프라, 및 서비스 관점의 기술이 반드시 필요하게 됩니다. MLOps는 기업에 필요한 혁신적인 아이디어(AI Model)을 적시에 비즈니스 환경에 적용할 수 있도록 지원하는 기술 및 트렌드 입니다. 주요 내용은 - 데이터를 둘러싼 환경의 변화 - 기업의 AI Model 적용시 마주하는 현실 - MLOps가 해결 가능한 문제들 - MLOps의 영역별 주요 기술들 - MLOps 도입 시 기업의 AI 환경은 어떻게 변할까? - AI 모델을 비즈니스 환경에 적용(배포)한다는 것은? 2021년 12월 코리아 데이터 비즈니스 트렌드(데이터산업진흥원 주최)에서 발표한 내용을 공유 가능한 부분만 정리함. 발표 영상 참고 : https://www.youtube.com/watch?v=lL-QtEzJ3WY

Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn

Neo4j

At Neo4j we believe that “Graphs Are Everywhere”. In this session, we’ll be exploring graphs within the Retail industry. We’ll discuss a range of data that are commonly available within a retail organisation, both online and “brick and mortar". We’ll illustrate some graphs which can be created by linking together different elements of that data and discuss the retail use cases those graphs can enable and transform. We’ll specifically focus on use cases like Personalised Recommendations (with a live demo), Supply Chain Management, Logistics, and Customer 360. We'll also look at some relevant graph algorithms and talk about opportunities for integration with Artificial Intelligence/Machine Learning technologies, which can be used along with Neo4j to generate new value using retail data. Walmart, Wobi, and others already deploy Neo4j for use cases like price comparison or real-time contextual and learning recommendation engines. Read about their use cases!

Splunk mint 소개

JunMyoung(준명) Youn(연)

Realtime analytics with Flink and Druid

Erhwen Kuo

DataCon.TW 2017 Real-time analytics with Flink and Druid 講者：郭二文 / 資深經理 @ 緯創資通企業應用系統講題：Real-time analytics with Flink and Druid 摘要： “即時地以多維 (OLAP) 的分式來對數據進行探索與分析是數據驅動業務的關鍵功能。這個議程將說明如何應用 Kafka、Flink 和 Druid 來構建實時在線分析系統並介紹這個快速可靠流式數據處理的架構。近年來，Apache Kafka 已成為高可用性和高度可擴展的消息傳遞的事實標準。而 Apache Flink 允許我們以最少的延遲消費，處理和生成數據。 Druid 是一個專為實時多維分析 (OLAP) 而設計的數據存儲，它低延遲數據攝取架構允許事件在它們創建後快速地就可以進行過濾、聚合和查詢的資料操作或OLAP分析。”

The Enterprise Knowledge Graph is a disruptive platform that combines emerging Big Data and Graph technologies to reinvent knowledge management inside organizations. This platform aims to organize and distribute the organization’s knowledge, and making it centralized and universally accessible to every employee. The Enterprise Knowledge Graph is a central place to structure, simplify and connect the knowledge of an organization. By removing complexity, the knowledge graph brings more transparency, openness and simplicity into organizations. That leads to democratized communications and empowers individuals to share knowledge and to make decisions based on comprehensive knowledge. This platform can change the way we work, challenge the traditional hierarchical approach to get work done and help to unleash human potential!

A chart of the big data ecosystem

Matt Turck

밑바닥부터 시작하는딥러닝 8장

Sunggon Song

법령 온톨로지의 구축 및 검색

Myungjin Lee

Best Practices for Scaling Data Science Across the Organization

Chasity Gibson

Effective data science in the enterprise is about aligning the right model, data, and infrastructure with the right outcomes. Most organizations today struggle to unlock the potential of data science to enhance decision-making and drive business value. Join Forrester and Anaconda for a webinar to learn best practices for scaling data science across your entire organization. Guest speaker Kjell Carlsson, a Forrester Senior Analyst, and Peter Wang, Anaconda CTO, will share their unique perspectives on how to tackle five key challenges facing organizations today: - Identifying, defining, and prioritizing valuable problems - Building the right teams - Leveraging the proper tools and platforms - Iterating and deploying effectively - Reaching end-users to generate value

I BELIEVE I CAN FLY

Sebastien Juras

Knowledge Graphs for Supply Chain Operations.pdf

Vaticle

Agility in supply chain operations has never been so important, especially with today's nonlinear and complex world. That is why companies with supply chains need knowledge graphs. So how do enterprises unleash the power of their own supply chain data to make smarter decisions? This is where bops comes into play. Bops activates supply chain data from existing operating systems (ERPs, Pos, OMS, etc) simplifying how operators optimize working capital in every decision. In this session, bops will showcase a few use cases that portray the power of a knowledge graph to represent a supply chain network composed of an end to end product flow driven by actions among plants, customers and suppliers. Supply chain operations visibility: - Story of a Product and an SKU: from raw material to finished goods track trace & bill of material deviations - Story of a Supplier – risk assessments – “the most influential supplier” - Story of a Process – anomaly detection – “what went wrong?” Join us for a lively discussion to learn how using knowledge graphs is already helping supply chain companies to better collect, unify, and activate their data. Speaker: Jorge Risquez Jorge is the Co-founder and CEO of bops, a headless supply chain intelligence platform helping manufacturers and distributors source, make, and deliver their products, and unlock working capital. Previously, Jorge spent a decade as a Supply Chain Consultant for Deloitte, where he worked with Fortune 500 companies such as Tyson and Cargill. In his spare time, he enjoys going for a run in Central Park and spending time with family and friends.

제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗

BOAZ Bigdata

데이터 분석 프로젝트를 진행한 COLLABO-AZ 팀에서는 아래와 같은 프로젝트를 진행했습니다. 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗 20기 정지혜 이화여자대학교 통계학과 20기 김지민 중앙대학교 응용통계학과 20기 오태연 단국대학교 정보통계학과 20기 최은선 한양대학교 에리카캠퍼스 정보사회미디어학과

사업소개서 Ab180(공개버전)

Sungpil Nam

스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...

AWSKRUG - AWS한국사용자모임

Choisir sa solution décisionnelle - Partie 2 - Des modèles à l’analyse

Philippe Geiger

Self Service Analytics at Twitch

Imply

As Twitch grew, both the amount of data we received and the number of employees interested in the data grew rapidly. In order to continue empowering decision making as we scaled, we turned to using Druid and Imply to provide self service analytics to both our technical and non technical staff allowing them to drill into high level metrics in lieu of reading generated reports. In this talk, learn how Twitch implemented a common analytics platform for the needs of many different teams supporting hundreds of users, thousands of queries, and ~5 billion events each day. This session will explain our Druid architecture in detail, including: -The end-to-end architecture deployed on Amazon that includes Kinesis, RDS, S3, Druid, Pivot and Tableau -How the data is brought together to deliver a unified view of live customer engagement and historical trends -Operational best practices we learnt scaling Druid -An example walk through using the platform

Frame - Feature Management for Productive Machine Learning

David Stein

Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018. Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.

淺談AWS上的Log解決方案

Chin-Han Hsu

린분석9-10

HyeonSeok Choi

WALD: A Modern & Sustainable Analytics Stack

Florian Wilhelm

The name WALD-stack stems from the four technologies it is composed of, i.e. a cloud-computing Warehouse like Snowflake or Google BigQuery, the open-source data integration engine Airbyte, the open-source full-stack BI platform Lightdash, and the open-source data transformation tool DBT. Using a Formula 1 Grand Prix dataset, I will give an overview of how these four tools complement each other perfectly for analytics tasks in an ELT approach. You will learn the specific uses of each tool as well as their particular features. My talk is based on a full tutorial, which you can find under https://waldstack.org.

[데이터야놀자 2023] 비즈니스 분석가 vs 프로덕트 분석가_ 데이터 분석ᄀ...

Jeongmin Ju

Deep dive into LangChain integration with Neo4j.pptx

TomazBratanic1

Automation Technology Powerpoint Presentation Slides

SlideTeam

"You can download this product from SlideTeam.net" SlideTeam introduces Automation Technology Powerpoint Presentation Slides. Gain access to 49 visually-gripping PPT templates by downloading this comprehensive complete deck. Each PowerPoint slide is 100% customizable. Personalize text, colors, font, patterns, background, orientation, and shapes to achieve any imaginable result. Convert the format into PDF, PNG, or JPG as and when necessary. Access this presentation using standard or widescreen resolutions. It is compatible with Google Slides. https://bit.ly/3HO8wvJ

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Databricks

Spark Summit EU 2015: Lessons from 300+ production users

Databricks

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

What's hot

개발자, 머신러닝 엔지니어로 살아남기

Curt Park

Enterprise Knowledge Graph

Lukas Masuch

A chart of the big data ecosystem

Matt Turck

밑바닥부터 시작하는딥러닝 8장

Sunggon Song

법령 온톨로지의 구축 및 검색

Myungjin Lee

Best Practices for Scaling Data Science Across the Organization

Chasity Gibson

I BELIEVE I CAN FLY

Sebastien Juras

Knowledge Graphs for Supply Chain Operations.pdf

Vaticle

제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗

BOAZ Bigdata

사업소개서 Ab180(공개버전)

Sungpil Nam

스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...

AWSKRUG - AWS한국사용자모임

Choisir sa solution décisionnelle - Partie 2 - Des modèles à l’analyse

Philippe Geiger

Self Service Analytics at Twitch

Imply

Frame - Feature Management for Productive Machine Learning

David Stein

淺談AWS上的Log解決方案

Chin-Han Hsu

린분석9-10

HyeonSeok Choi

WALD: A Modern & Sustainable Analytics Stack

Florian Wilhelm

[데이터야놀자 2023] 비즈니스 분석가 vs 프로덕트 분석가_ 데이터 분석ᄀ...

Jeongmin Ju

Deep dive into LangChain integration with Neo4j.pptx

TomazBratanic1

Automation Technology Powerpoint Presentation Slides

SlideTeam

What's hot (20)

개발자, 머신러닝 엔지니어로 살아남기

Enterprise Knowledge Graph

A chart of the big data ecosystem

밑바닥부터 시작하는딥러닝 8장

법령 온톨로지의 구축 및 검색

Best Practices for Scaling Data Science Across the Organization

I BELIEVE I CAN FLY

Knowledge Graphs for Supply Chain Operations.pdf

제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗

사업소개서 Ab180(공개버전)

스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...

Choisir sa solution décisionnelle - Partie 2 - Des modèles à l’analyse

Self Service Analytics at Twitch

Frame - Feature Management for Productive Machine Learning

淺談AWS上的Log解決方案

린분석9-10

WALD: A Modern & Sustainable Analytics Stack

[데이터야놀자 2023] 비즈니스 분석가 vs 프로덕트 분석가_ 데이터 분석ᄀ...

Deep dive into LangChain integration with Neo4j.pptx

Automation Technology Powerpoint Presentation Slides

Viewers also liked

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Databricks

Spark Summit EU 2015: Lessons from 300+ production users

Databricks

Enabling exploratory data science with Spark and R

Databricks

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Databricks

The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show: 1) How we tried to solve this problem using traditional DW techniques 2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients. 3) Some of the key learnings we had when migrating from DW to Spark.

Spark Summit EU 2015: Matei Zaharia keynote

Databricks

2015 was a year of continued growth for Spark, with numerous additions to the core project and very fast growth of use cases across the industry. In this talk, I’ll look back at how the Spark community is has grown and changed in 2015, based on a large Apache Spark user survey conducted by Databricks. We see some interesting trends in the diversity of runtime environments (which are increasingly not just Hadoop); the types of applications run on Spark; and the types of users, now that features like R support and DataFrames are available in Spark. I’ll also cover the ongoing work in the upcoming releases of Spark to support new use cases.

Spark Summit EU 2015: Reynold Xin Keynote

Databricks

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...

Databricks

A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.

Viewers also liked (7)

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Spark Summit EU 2015: Lessons from 300+ production users

Enabling exploratory data science with Spark and R

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Spark Summit EU 2015: Matei Zaharia keynote

Spark Summit EU 2015: Reynold Xin Keynote

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...

Similar to Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Combining Machine Learning frameworks with Apache Spark

DataWorks Summit/Hadoop Summit

Combining Machine Learning Frameworks with Apache Spark

Databricks

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Databricks

These are the slides to support the Apache® Spark™ MLlib: From Quick Start to Scikit-Learn webinar. In this webcast, Joseph Bradley from Databricks will be speaking about Apache Spark’s distributed Machine Learning Library - MLlib. We will start off with a quick primer on machine learning, Spark MLlib, and a quick overview of some Spark machine learning use cases. We will continue with multiple Spark MLlib quick start demos. Afterwards, the talk will transition toward the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib

Apache Spark sql

aftab alam

From Pipelines to Refineries: scaling big data applications with Tim Hunter

Databricks

Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.

DataMass Summit - Machine Learning for Big Data in SQL Server

Łukasz Grala

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...

Perficient, Inc.

Most organizations still rely on batch and offline processing of data streams to gain meaningful analysis and insight into their business. However, in our instant gratification world, real-time computation and analysis of streaming data is crucial in gaining insight into patterns and threats. A trend is emerging for real-time and instant analysis from live data streams, promoting the value of logs and a move toward functional programming. This shift in technology is not about what and how to store the data, but what we can do with it to see emerging patterns and trends across multiple resources, applications, services and environments. Log data represents a wealth of information, yet is often sporadic, unstructured, scattered across the enterprise and difficult to track. These slides provide insights into some of the most helpful Big Data tools used by the largest social media and data-centric organizations for competitive trends, instant analysis and feedback from large volume data streams. We show how how using Big Data tools Storm, ElasticSearch and an elastic UI can turn application logs into real-time analytical views. You will also learn how Big Data: Contains data that is elastic, minimally structured, flexible and scalable Helps process live streams into meaningful data Promotes a move toward functional programming Effects the enterprise data architecture Works with real-time CEP tools like Storm for functional programming

Fighting Fraud with Apache Spark

Miklos Christine

Using BigBench to compare Hive and Spark (Long version)

Nicolas Poggi

BigBench is the brand new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several application use cases combining the need for SQL queries, Map/Reduce, user code (UDF), Machine Learning, and even streaming. From the available implementation, we can test the different framework combinations such as Hadoop+Hive (with Mahout) and Spark (SparkSQL+MLlib) in their different versions and configurations, helping us to spot problems and possible optimizations of our data stacks. This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with their respective 1 and 2 versions under distinct configurations including Tez, Mahout, MLlib. Experiments are run on Cloud and On-Prem clusters of different numbers of nodes and data scales, taking into account interactive and batch usage. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and logfile analysis. The talk concludes with the main findings, the scalability, and limits of each framework. Originally presented at: https://dataworkssummit.com/munich-2017/sessions/using-bigbench-to-compare-hive-and-spark-versions-and-features/

Designing Distributed Machine Learning on Apache Spark

Databricks

This talk will cover challenges in distributing Machine Learning (ML) algorithms. I will begin with background: constraints introduced by distributed computing, major frameworks for distributed computing (including Apache Spark’s framework), and approaches for distributing ML. I will then give 2 examples of distributing common algorithms. The first, K-Means clustering, can be distributed easily. The second, decision trees, is more difficult. I will discuss distributing data by row vs. column, mentioning the resulting tradeoffs in communication, computation, and accuracy. I will also give a quick demo of learning trees in these two ways using Apache Spark to demonstrate the difference in practice. This discussion will be targeted at ML or Spark users who have some knowledge in at least one area, but not necessarily deep expertise. Listeners should come away with a better understanding of Spark’s approach to distributed ML. This knowledge should be helpful for users who want to understand strengths and limitations of distributed ML implementations, as well as developers who wish to implement their own algorithms.

From Pipelines to Refineries: Scaling Big Data Applications

Databricks

The Challenges of Bringing Machine Learning to the Masses

Alice Zheng

Apache Spark's MLlib's Past Trajectory and new Directions

Databricks

This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib. Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements. This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

MongoDB

With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between. In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again. In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy. This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.

A Hands-on Intro to Data Science and R Presentation.ppt

Sanket Shikhar

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Miklos Christine

Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs. Talk Overview: Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark. Demo: Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool

Azure Databricks for Data Scientists

Richard Garris

Spark at Zillow

Steven Hoelscher

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

Jose Quesada (hiring)

The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production? The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance. In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.

Similar to Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R (20)

Combining Machine Learning frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache Spark

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache Spark sql

From Pipelines to Refineries: scaling big data applications with Tim Hunter

DataMass Summit - Machine Learning for Big Data in SQL Server

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...

Fighting Fraud with Apache Spark

Using BigBench to compare Hive and Spark (Long version)

Designing Distributed Machine Learning on Apache Spark

From Pipelines to Refineries: Scaling Big Data Applications

The Challenges of Bringing Machine Learning to the Masses

Apache Spark's MLlib's Past Trajectory and new Directions

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

A Hands-on Intro to Data Science and R Presentation.ppt

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...

Azure Databricks for Data Scientists

Spark at Zillow

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

RISE with SAP and Journey to the Intelligent Enterprise

Srikant77

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR

Tier1 app

Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

Globus

JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.

How Recreation Management Software Can Streamline Your Operations.pptx

wottaspaceseo

Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Natan Silnitsky

In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey. Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience. Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system. Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.

Globus Compute wth IRI Workflows - GlobusWorld 2024

Globus

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

SOCRadar Research Team: Latest Activities of IntelBroker

SOCRadar

The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month. The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies. However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News. Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!

Accelerate Enterprise Software Engineering with Platformless

WSO2

Key takeaways: Challenges of building platforms and the benefits of platformless. Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience. How Choreo enables the platformless experience. How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo. Demo of an end-to-end app built and deployed on Choreo.

How to Position Your Globus Data Portal for Success Ten Good Practices

Globus

Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

rickgrimesss22

Understanding Globus Data Transfers with NetSage

Globus

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf

Jay Das

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

takuyayamamoto1800

A Comprehensive Look at Generative AI in Retail App Testing.pdf

kalichargn70th171

A Sighting of filterA in Typelevel Rite of Passage

Philip Schwarz

Recently uploaded (20)

RISE with SAP and Journey to the Intelligent Enterprise

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Into the Box 2024 - Keynote Day 2 Slides.pdf

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

How Recreation Management Software Can Streamline Your Operations.pptx

Globus Connect Server Deep Dive - GlobusWorld 2024

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Globus Compute wth IRI Workflows - GlobusWorld 2024

SOCRadar Research Team: Latest Activities of IntelBroker

Accelerate Enterprise Software Engineering with Platformless

How to Position Your Globus Data Portal for Success Ten Good Practices

BoxLang: Review our Visionary Licenses of 2024

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

Understanding Globus Data Transfers with NetSage

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

A Comprehensive Look at Generative AI in Retail App Testing.pdf

A Sighting of filterA in Typelevel Rite of Passage

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

1. Combining the Strengths of MLlib, scikit-learn, & R Joseph K. Bradley Spark Summit Europe October 2015

2. About me ApacheSpark committer Software Engineer@ Databricks Ph.D. in Machine Learning @ CarnegieMellon University 2

3. scikit-learn & R 3

4. 4

5. scikit-learn & R Greatlibraries • Detailed documentation & how-to guides • Many packages& extensions Business investment • Education • Tooling & workflows 5

6. Big Data 6 Scaling (trees)Topic model on 4.5 million Wikipedia articles Recommendation with 50 million users, 5 million songs, 50 billion ratings

7. Big Data & MLlib • More data à higher accuracy • Scalewith business (# users,available data) • Integrate with production systems 7

8. Bridging the gap How do you get from a single-machine workload to a fully distributed one? 8 At school: Machine Learning with R on my laptop The Goal: Machine Learning on a huge computing cluster

9. Wish list • Run original code on a production environment • Use distributed data sources • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings • Use familiar APIs 9

10. Our task 10 Sentiment analysis Given a review (text), Predict the user’srating. Data from https://snap.stanford.edu/data/web-‐Amazon.html

11. Our ML workflow 11 Text This scarf I bought is very strange. When I ... Label Rating = 3.0 Tokenizer Words [This, scarf, I, bought, ...] Hashing Term-Freq Features [2.0, 0.0, 3.0, ...] Linear Regression Prediction Rating = 2.0

12. Our ML workflow 12 Cross Validation Linear Regression Feature Extraction regularization parameter: {0.0, 0.1, ...}

13. Cross validation 13 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3

14. Cross validation 14 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3

15. Distribute cross validation 15 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3

16. Distribute feature extraction 16 Cross Validation ... Best Linear Regression Linear Regression #1 Feature Extraction #1 Feature Extraction #2 Feature Extraction #3 ... Linear Regression #2 Linear Regression #3

17. Feature Extraction #1 Distribute learning 17 Cross Validation ... Best Linear Regression Feature Extraction #2 Feature Extraction #3 Linear Regression #1 Linear Regression #2 ...

18. Improvements we observed Also, in practice: • More folds of Cross Validation • Tune more parameters • Increase model size as dataset size increases 18 1) Faster model selection for small data 2) Faster training for large data 3) Better predictions (R^2) with more data

19. Integrations • Distributed data sources • Conversionsbetween pandas& Spark • Conversionsbetween scipy & MLlib types • Distributed model selection • Distributed feature extraction • Distributed learning • Conversionsbetween scikit-learn & MLlib models 19

20. Integrations with R DataFrames • Conversionsbetween R (local)& Spark (distributed) • SQL queries from R 20 model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") head(filter(df, df$waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 R-like MLlib API for generalizedlinearmodels

21. Repeating this at home This demo used: • Spark 1.5 • The pdspark Spark Package (tobe released soon!) The code will be posted online. Also see sparkit-learn package 21 Try it on Databricks with a free trial @ databricks.com

22. What’s next? Further work on integrations • Python:Support more models& data types • R: Expand GLM formula (feature interactions) & other models Match features & behavior Getinvolved! • Contribute to Spark & Spark packages • Provide feedback 22

23. Thank you! spark.apache.org spark-packages.org databricks.com

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Similar to Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R