Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

•

18 likes•3,720 views

This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.

Data & Analytics

Spark’s Role in the Big Data Ecosystem
Matei Zaharia

An Exciting Year for Spark
Very fast community growth
1.0 release in May
7+ distributors, 20+ apps

Project Activity
June 2013
June 2014
total
contributors
68
255
companies
contributing
17
50
total lines"
of code
63,000
175,000

Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
1400
1200
1000
800
600
400
200
0
MapReduce
YARN
HDFS
Storm
Spark
300000
250000
200000
150000
100000
50000
0
Commits
Lines of Code Changed
Activity in past 6 months

Compared to Other Projects
Spark is one of top 3 most active projects at Apache
More active than “general” data processing projects
like NumPy, matplotlib, SciKit-Learn

Continuing Growth
source: ohloh.net
Contributors per month to Spark

Last Summit
Last Summit we said we’d focus on two things:
• Standard libraries
• Enterprise features
New libraries: Spark SQL, MLlib (machine learning),
GraphX (graph processing)
Enterprise features: security, monitoring, HA

Spark SQL
Enables loading & querying structured data in Spark
From Hive:
c = HiveContext(sc)!
rows = c.sql(“select text, year from hivetable”)!
rows.filter(lambda r: r.year > 2013).collect()!
{“text”: “hi”,
“user”: {
“name”: “matei”,
“id”: 123
}}
From JSON:
c.jsonFile(“tweets.json”).registerAsTable(“tweets”)!
c.sql(“select text, user.name from tweets”)!
tweets.json

Spark SQL
Integrates closely with Spark’s language APIs
c.registerFunction(“hasSpark”, lambda text: “Spark” in text)!
c.sql(“select * from tweets where hasSpark(text)”)!
Uniform interface for data access
44 contributors in
past year
Hive
Parquet
JSON
Cassan-dra
…
SQL
Python
Scala
Java

Machine Learning Library (MLlib)
Standard library of machine learning algorithms
Now includes 15+ algorithms
• New in 1.0: decision trees, SVD, PCA, L-BFGS
• In development: non-negative matrix factorization, LDA,
Lanczos, multiclass trees, ADMM
points = context.sql(“select latitude, longitude from tweets”)!
model = KMeans.train(points, 10)!
!
40 contributors in
past year

Java 8 API
Enables concise programming in Java similar to
Scala and Python
JavaRDD<String> lines = sc.textFile("data.txt");!
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());!
int totalLength = lineLengths.reduce((a, b) -> a + b);!

1. Unified Platform for Big Data Apps
Batch
Interactive
Streaming
Hadoop
Cassandra
Mesos
…
Uniform API for diverse workloads over diverse
storage systems and runtimes
…
Cloud
Providers
…

Why a Platform Matters
Good for developers: one system to learn
Good for users: take apps anywhere
Good for distributors: more applications

2. Standard Library for Big Data
Big data apps lack libraries"
of common algorithms
Spark’s generality + support"
for multiple languages make it"
suitable to offer this
Python
Scala
Java
R
SQL
ML
graph
Core
…
Much of future activity will be in these libraries

Databricks & Spark
At Databricks, we are working to keep Spark 100%
open source and compatible across vendors
All our work on Spark is at Apache
Check out project-specific talks to see what’s next!

While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them. Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.

Databricks for Dummies

Rodney Joyce

Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;) After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters. This is part 1 of an 8 part Data Science for Dummies series: Databricks for dummies Titanic survival prediction with Databricks + Python + Spark ML Titanic with Azure Machine Learning Studio Titanic with Databricks + Azure Machine Learning Service Titanic with Databricks + MLS + AutoML Titanic with Databricks + MLFlow Titanic with DataRobot Deployment, DevOps/MLops and Operationalization

Databricks on AWS.pptx

Wasm1953

Modernizing to a Cloud Data Architecture

Databricks

Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.

Data Lakehouse Symposium | Day 4

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Achieving Lakehouse Models with Spark 3.0

Databricks

It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?

Announcing Databricks Cloud (Spark Summit 2014)

Databricks

Databricks Platform.pptx

Alex Ivy

Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.

Introduction to Data Engineering

Durga Gadiraju

As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends. * Introduction to Data Engineering * Role of Big Data in Data Engineering * Key Skills related to Data Engineering * Role of Big Data in Data Engineering * Overview of Data Engineering Certifications * Free Content and ITVersity Paid Resources Don't worry if you miss the video - you can click on the below link to go through the video after the schedule. https://youtu.be/dj565kgP1Ss * Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/ Relevant Playlists: * Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi * Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl * Join our Meetup group - https://www.meetup.com/itversityin/ * Enroll for our labs - https://labs.itversity.com/plans * Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1 * Access Content via our GitHub - https://github.com/dgadiraju/itversity-books * Lab and Content Support using Slack

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Making Apache Spark Better with Delta Lake

Databricks

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved!

Delta lake and the delta architecture

Adam Doyle

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

Databricks Fundamentals

Dalibor Wijas

Making Data Timelier and More Reliable with Lakehouse Technology

Matei Zaharia

Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.

Intro to Delta Lake

Databricks

Designing Structured Streaming Pipelines—How to Architect Things Right

Databricks

"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved. What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data? What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output? When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency? How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions. These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."

Diving into Delta Lake: Unpacking the Transaction Log

Databricks

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this session, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes.

Building Lakehouses on Delta Lake with SQL Analytics Primer

Databricks

The delta architecture

Prakash Chockalingam

Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems. There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020

Timothy McAliley

Introduction to Azure Data Lake

Antonios Chatzipavlis

Introduction to Azure Databricks

James Serra

Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Data Discovery at Databricks with Amundsen

Databricks

Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally. We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including: Surface the most popular tables used within Databricks Support fuzzy search and facet search for dataset- Surface rich metadata on datasets: Lineage information (downstream table, upstream table, downstream jobs, downstream users) Dataset owner Dataset frequent users Delta extend metadata (e.g change history) ETL job that generates the dataset Column stats on numeric type columns Dashboards that use the given dataset Use Databricks data tab to show the sample data Surface metadata on dashboards including: create time, last update time, tables used, etc Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.

Big Data Ecosystem - 1000 Simulated Drones

Espeo Software

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

Amazon Web Services

Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.

What's hot

Summary introduction to data engineering

Novita Sari

DW Migration Webinar-March 2022.pptx

Databricks

Databricks Delta Lake and Its Benefits

Databricks

Introduction to Data Engineering

Durga Gadiraju

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Making Apache Spark Better with Delta Lake

Databricks

Delta lake and the delta architecture

Adam Doyle

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

Databricks Fundamentals

Dalibor Wijas

Making Data Timelier and More Reliable with Lakehouse Technology

Matei Zaharia

Intro to Delta Lake

Databricks

Designing Structured Streaming Pipelines—How to Architect Things Right

Databricks

Diving into Delta Lake: Unpacking the Transaction Log

Databricks

Building Lakehouses on Delta Lake with SQL Analytics Primer

Databricks

The delta architecture

Prakash Chockalingam

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020

Timothy McAliley

Introduction to Azure Data Lake

Antonios Chatzipavlis

Introduction to Azure Databricks

James Serra

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

Data Discovery at Databricks with Amundsen

Databricks

What's hot (20)

Summary introduction to data engineering

DW Migration Webinar-March 2022.pptx

Databricks Delta Lake and Its Benefits

Introduction to Data Engineering

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Making Apache Spark Better with Delta Lake

Delta lake and the delta architecture

Real-time Analytics with Trino and Apache Pinot

Databricks Fundamentals

Making Data Timelier and More Reliable with Lakehouse Technology

Intro to Delta Lake

Designing Structured Streaming Pipelines—How to Architect Things Right

Diving into Delta Lake: Unpacking the Transaction Log

Building Lakehouses on Delta Lake with SQL Analytics Primer

The delta architecture

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020

Introduction to Azure Data Lake

Introduction to Azure Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Data Discovery at Databricks with Amundsen

Viewers also liked

Big Data Ecosystem - 1000 Simulated Drones

Espeo Software

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

Amazon Web Services

Temporal Databases: Data Models

torp42

JupyterHub for Interactive Data Science Collaboration

Carol Willing

Jupyter, A Platform for Data Science at Scale

Matthias Bussonnier

Lawrence berkeley national laboratory sep 2015 - Jupyter Talk Scientific facilities are increasingly generating large data sets. Next-generation scientific productivity relies on user-friendly tools and efficient, effective and seamless access to resources and data. Traditional approaches to research and software development for science focus on the hardware and software of the machine and do not consider the user. In this talk, I will highlight a different approach to building software for scientific users by including user knowledge in the process. I will illustrate a few example projects where this has been used to date. GIthub repository: https://github.com/Carreau/talks/tree/master/labtech-2015

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...

Mitul Tiwari

Big data ecosystemmagda3695

Temporal

sunsie

Bde euro proworkshop

BigData_Europe

Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...

Denodo

Autodesk designed a modern data architecture that heavily uses data virtualization to integrate both legacy data sources and contemporary big data analytics like Spark into a single unified logical data warehouse. In this presentation, you will learn how to build a logical data warehouse using data virtualization and create a single, unified enterprise-wide access and governance point for any data used within the company. This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/Ab4PDB.

Temporal databaseHussain Azmee

The Big Data Ecosystem for Financial Services

DataStax

Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016

Caserta

The Big Data Ecosystem at LinkedInOSCON Byrum

BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”

BigData_Europe

1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...

Jürgen Ambrosi

I dati sono il nuovo Capitale: come il capitale finanziario, sono una risorsa che deve essere gestita, raccolta e tenuta al sicuro, ma deve essere anche investita dalle organizzazioni che vogliono ottenere vantaggio competitivo. I dati non sono una risorsa nuova, ma soltanto oggi per la prima volta sono disponbili in abbondanza assieme alle tecnologie necessarie per massimizzarne il ritorno. Esattamente come l'elettricità fu una curiosità da laboratorio per molto tempo, finché non venne resa disponibile alle masse e dunque cambiò totalmente il volto dell'industria moderna.Ecco perché per accelerare il cambiamento è necessario un approccio innovativo alla esecuzione delle iniziative orientate ai Big Data: un laboratorio analitico come catalizzatore dell'innovazione (Data Lab).In questo webinar sulle tecnologie Oracle, utilizzeremo il consueto approccio del racconto basato su casi d’uso ed esperienze concrete.

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Summit

Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search. Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.

The Ecosystem is too damn big

DataWorks Summit/Hadoop Summit

Overview - IBM Big Data Platform

Vikas Manoria

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka

Edureka!

This Edureka Hadoop Ecosystem Tutorial (Hadoop Ecosystem blog: https://goo.gl/EbuBGM) will help you understand about a set of tools and services which together form a Hadoop Ecosystem. Below are the topics covered in this Hadoop Ecosystem Tutorial: Hadoop Ecosystem: 1. HDFS - Hadoop Distributed File System 2. YARN - Yet Another Resource Negotiator 3. MapReduce - Data processing using programming 4. Spark - In-memory Data Processing 5. Pig, Hive - Data Processing Services using Query 6. HBase - NoSQL Database 7. Mahout, Spark MLlib - Machine Learning 8. Apache Drill - SQL on Hadoop 9. Zookeeper - Managing Cluster 10. Oozie - Job Scheduling 11. Flume, Sqoop - Data Ingesting Services 12. Solr & Lucene - Searching & Indexing 13. Ambari - Provision, Monitor and Maintain Cluster

Viewers also liked (20)

Big Data Ecosystem - 1000 Simulated Drones

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

Temporal Databases: Data Models

JupyterHub for Interactive Data Science Collaboration

Jupyter, A Platform for Data Science at Scale

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...

Big data ecosystem

Temporal

Bde euro proworkshop

Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...

Temporal database

The Big Data Ecosystem for Financial Services

Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016

The Big Data Ecosystem at LinkedIn

BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”

1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

The Ecosystem is too damn big

Overview - IBM Big Data Platform

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka

Similar to Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Composable Parallel Processing in Apache Spark and Weld

Databricks

The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc). Speaker: Matei Zaharia

BDTC2015 databricks-辛湜-state of spark

Jerry Wen

From http://www.csdn.net/article/2015-12-17/2826501 《Databricks公司联合创始人、Spark首席架构师辛湜：Spark发展：回顾2015，展望2016 》辛湜介绍了Spark的目标是“Unified engine across data workloads and platforms”。在谈到Spark在2015年最大的改变时，他感觉应该是增加了DataFrames API。对于Spark的生态圈，他表示主要侧重三个不同的方向，一个是上层的应用，二是下层的环境，还有最重要的是连接到的数据源。

Spark streaming State of the Union - Strata San Jose 2015

Databricks

Why spark by Stratio - v.1.0

Stratio

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

DataStax Academy

Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.

The BDAS Open Source Community

jeykottalam

Dev Ops Training

Spark Summit

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Databricks

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Big Data Processing with .NET and Spark (SQLBits 2020)

Michael Rys

Present and future of unified, portable, and efficient data processing with A...

DataWorks Summit

The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere." This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem. Speaker Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant

Big data apache spark + scala

Juantomás García Molina

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...

Databricks

2017 continues to be an exciting year for big data and Apache Spark. I will talk about two major initiatives that Databricks has been building: Structured Streaming, the new high-level API for stream processing, and new libraries that we are developing for machine learning. These initiatives can provide order of magnitude performance improvements over current open source systems while making stream processing and machine learning more accessible than ever before.

Started with-apache-spark

Happiest Minds Technologies

New directions for Apache Spark in 2015

Databricks

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Simplilearn

This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture. YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Spark + AI Summit 2020 イベント概要

Paulo Gutierrez

Koalas: Unifying Spark and pandas APIs

Takuya UESHIN

H2O World - H2O Rains with Databricks Cloud

Sri Ambati

Spark Community Update - Spark Summit San Francisco 2015

Databricks

Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...

Codemotion

Scala è un linguaggio di programmazione general purpose multi-paradigma pensato per realizzare applicazioni ad alte prestazioni che girano all'interno della Java Virtual Machine. Spark è il framework "Big Data", basato su Scala, più flessibile e performante disponibile oggi sul mercato. Durante il talk verrà introdotto il linguaggio Scala e verranno mostrate le potenzialità legate al suo utilizzo nell'ambito dello sviluppo di applicazioni web di ultima generazione compresa la possibilità di processamento parallelo di grandi quantità di dati attraverso l'utilizzo del framework Spark.

Similar to Spark's Role in the Big Data Ecosystem (Spark Summit 2014) (20)

Composable Parallel Processing in Apache Spark and Weld

BDTC2015 databricks-辛湜-state of spark

Spark streaming State of the Union - Strata San Jose 2015

Why spark by Stratio - v.1.0

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

The BDAS Open Source Community

Dev Ops Training

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Big Data Processing with .NET and Spark (SQLBits 2020)

Present and future of unified, portable, and efficient data processing with A...

Big data apache spark + scala

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...

Started with-apache-spark

New directions for Apache Spark in 2015

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Spark + AI Summit 2020 イベント概要

Koalas: Unifying Spark and pandas APIs

H2O World - H2O Rains with Databricks Cloud

Spark Community Update - Spark Summit San Francisco 2015

Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Machine Learning CI/CD for Email Attack Detection

Databricks

Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models. In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Databricks

Sarah: CEO-Finance-Report pipeline seems to be slow today. Why Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days. Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert. We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example: Tom: I am not seeing any data for today in my Campaign Metrics Dashboard. Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021. This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Databricks

Hyperparameter tuning is critical in model development. And its general form: parameter tuning with an objective function is also widely used in industry. On the other hand, Apache Spark can handle massive parallelism, and Apache Spark ML is a solid machine learning solution. But we have not seen a general and intuitive distributed parameter tuning solution based on Apache Spark, why? Not every tuning problem is on Apache Spark ML models. How can Apache Spark handle general models? Not every tuning problem is a parallelizable grid or random search. Bayesian optimization is sequential, how can Apache Spark help in this case? Not every tuning problem is single epoch, deep learning is not. How to fit algos such as hyperband and ASHA into Apache Spark? Not every tuning problem is a machine learning problem, for example simulation + tuning is also common. How to generalize? In this talk, we are going to show how using Fugue-Tune and Apache Spark together can eliminate these painpoints Fugue-Tune like Fugue, is a “super framework” – an absraction layer unifying existing solutions such as Hyperopt and Optuna It firstly models the general tuning problems, independent from machine learning It is designed for both small and large scale problems. It can always fully parallelize the distributable part of a tuning problem It works for both classical and deep learning models. With Fugue, running hyperband and ASHA becomes possible on Apache Spark. In the demo, you will see how to do any type of tuning in a consistent, intuitive, scalable and minimal way. And you will see a live demo of the amazing performance.

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf

Criminal IP

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样

u86oixdj

学校原件一模一样【微信：741003700 】《(Deakin毕业证书)迪肯大学毕业证学位证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样

axoqas

原版定制【Q微信:741003700】《(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书》【Q微信:741003700】成绩单、雅思、外壳、留信学历认证永久存档查询，采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【Q微信741003700】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信741003700】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理

74nqk8xf

毕业原版【微信:41543339】【(Coventry毕业证书)考文垂大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Subhajit Sahu

Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Subhajit Sahu

Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理

oz8q3jxlp

原版定制【微信:41543339】【(Deakin毕业证书)迪肯大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Subhajit Sahu

For massive graphs that fit in RAM, but not in GPU memory, it is possible to take advantage of a shared memory system with multiple CPUs, each with multiple cores, to accelerate pagerank computation. If the NUMA architecture of the system is properly taken into account with good vertex partitioning, the speedup can be significant. To take steps in this direction, experiments are conducted to implement pagerank in OpenMP using two different approaches, uniform and hybrid. The uniform approach runs all primitives required for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid approach runs certain primitives in sequential mode (i.e., sumAt, multiply).

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理

g4dpvqap0

毕业原版【微信:41543339】【(爱大毕业证书)爱丁堡大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样

u86oixdj

学校原件一模一样【微信：741003700 】《(swinburne毕业证书)斯威本科技大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Best best suvichar in gujarati english meaning of this sentence as Silk road ...

AbhimanyuSinha9

My burning issue is homelessness K.C.M.O.

rwarrenll

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

Timothy Spann

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI Discussion on Vector Databases, Unstructured Data and AI https://www.meetup.com/unstructured-data-meetup-new-york/ This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

Timothy Spann

Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx

AnirbanRoy608946

Machine learning and optimization techniques for electrical drives.pptx

balafet

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...

pchutichetpong

M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years. Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success. MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies. According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...

John Andrews

SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation" Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults Description: Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project. Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单

ewymefz

UofM毕业证【微信95270640】办文凭{明尼苏达大学毕业证}Q微Q微信95270640UofM毕业证书成绩单/学历认证UofM Diploma未毕业、挂科怎么办？+QQ微信：Q微信95270640-大学Offer（申请大学）、成绩单（申请考研）、语言证书、在读证明、使馆公证、办真实留信网认证、真实大使馆认证、学历认证办国外明尼苏达大学明尼苏达大学毕业证假文凭教育部学历学位认证留信认证大使馆认证留学回国人员证明修改成绩单信封申请学校offer录取通知书在读证明offer letter。快速办理高仿国外毕业证成绩单： 1明尼苏达大学毕业证+成绩单+留学回国人员证明+教育部学历认证（全套留学回国必备证明材料给父母及亲朋好友一份完美交代）; 2雅思成绩单托福成绩单OFFER在读证明等留学相关材料（申请学校转学甚至是申请工签都可以用到）。 3.毕业证 #成绩单等全套材料从防伪到印刷从水印到钢印烫金高精仿度跟学校原版100%相同。专业服务请勿犹豫联系我！联系人微信号：95270640诚招代理：本公司诚聘当地代理人员如果你有业余时间有兴趣就请联系我们。国外明尼苏达大学明尼苏达大学毕业证假文凭办理过程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。有一次山娃坐在门口写作业写着写着竟伏在桌上睡着了迷迷糊糊中山娃似乎听到了父亲的脚步声当他晃晃悠悠站起来时才诧然发现一位衣衫破旧的妇女挎着一只硕大的蛇皮袋手里拎着长铁钩正站在门口朝黑色的屋内张望不好坏人小偷山娃一怔却也灵机一动立马仰起头双手拢在嘴边朝楼上大喊：“爸爸爸——有人找——那人一听朝山娃尴尬地笑笑悻悻地走了山娃立马“嘭的一声将铁门锁死心却咚咚地乱跳当山娃跟父亲说起这事时父亲很吃惊抚摸着山娃母

The affect of service quality and online reviews on customer loyalty in the E...

jerlynmaetalle

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样

Best best suvichar in gujarati english meaning of this sentence as Silk road ...

My burning issue is homelessness K.C.M.O.

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx

Machine learning and optimization techniques for electrical drives.pptx

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单

The affect of service quality and online reviews on customer loyalty in the E...

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

1. Spark’s Role in the Big Data Ecosystem Matei Zaharia

2. An Exciting Year for Spark Very fast community growth 1.0 release in May 7+ distributors, 20+ apps

3. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines" of code 63,000 175,000

4. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines" of code 63,000 175,000

5. Compared to Other Projects MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months

6. Compared to Other Projects MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Spark is now the most active project in the" Hadoop ecosystem Activity in past 6 months

7. Compared to Other Projects Spark is one of top 3 most active projects at Apache More active than “general” data processing projects like NumPy, matplotlib, SciKit-Learn

8. Continuing Growth source: ohloh.net Contributors per month to Spark

9. Major new additions

10. Last Summit Last Summit we said we’d focus on two things: • Standard libraries • Enterprise features New libraries: Spark SQL, MLlib (machine learning), GraphX (graph processing) Enterprise features: security, monitoring, HA

11. Spark SQL Enables loading & querying structured data in Spark From Hive: c = HiveContext(sc)! rows = c.sql(“select text, year from hivetable”)! rows.filter(lambda r: r.year > 2013).collect()! {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”)! c.sql(“select text, user.name from tweets”)! tweets.json

12. Spark SQL Integrates closely with Spark’s language APIs c.registerFunction(“hasSpark”, lambda text: “Spark” in text)! c.sql(“select * from tweets where hasSpark(text)”)! Uniform interface for data access 44 contributors in past year Hive Parquet JSON Cassan-dra … SQL Python Scala Java

13. Machine Learning Library (MLlib) Standard library of machine learning algorithms Now includes 15+ algorithms • New in 1.0: decision trees, SVD, PCA, L-BFGS • In development: non-negative matrix factorization, LDA, Lanczos, multiclass trees, ADMM points = context.sql(“select latitude, longitude from tweets”)! model = KMeans.train(points, 10)! ! 40 contributors in past year

14. Java 8 API Enables concise programming in Java similar to Scala and Python JavaRDD<String> lines = sc.textFile("data.txt");! JavaRDD<Integer> lineLengths = lines.map(s -> s.length());! int totalLength = lineLengths.reduce((a, b) -> a + b);!

15. What is our vision for Spark?

16. 1. Unified Platform for Big Data Apps Batch Interactive Streaming Hadoop Cassandra Mesos … Uniform API for diverse workloads over diverse storage systems and runtimes … Cloud Providers …

17. Why a Platform Matters Good for developers: one system to learn Good for users: take apps anywhere Good for distributors: more applications

18. 2. Standard Library for Big Data Big data apps lack libraries" of common algorithms Spark’s generality + support" for multiple languages make it" suitable to offer this Python Scala Java R SQL ML graph Core … Much of future activity will be in these libraries

19. Databricks & Spark At Databricks, we are working to keep Spark 100% open source and compatible across vendors All our work on Spark is at Apache Check out project-specific talks to see what’s next!

20. Thank You and Enjoy Spark Summit!

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Similar to Spark's Role in the Big Data Ecosystem (Spark Summit 2014) (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)