Helsinki Spark Meetup Nov 20, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
Title
Real-time, Advanced Analytics and Recommendations using Machine Learning, Graph Processing, Natural Language Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
BONUS: Netflix Recommendations: Then and Now
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
BONUS: Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Feature Talk: Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird
Talk Abstract: Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Talk Agenda:
• Intro
• Live, Interactive Recommendations Demo
• Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird (advancedspark.com)
• Types of Similarity
• Euclidean vs. Non-Euclidean Similarity
• Jaccard Similarity
• Cosine Similarity
• LogLikelihood Similarity
• Edit Distance
• Text-based Similarities and Analytics
• Word2Vec
• LDA Topic Extraction
• TextRank
• Similarity-based Recommendations
• User-to-User
• Content-based, Item-to-Item (Amazon)
• Collaborative-based, User-to-Item (Netflix)
• Graph-based, Item-to-Item "Pathways" (Spotify)
• Aggregations, Approximations, and Similarities at Scale
• Twitter Algebird
• MinHash and Bucketing
• Locality Sensitive Hashing (LSH)
• BloomFilters
• CountMin Sketch
• HyperLogLog
• Q & A
Speaker Bio: Chris Fregly is a Research Engineer @ Flux Capacitor AI in SF, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
This talk highlights the Data Sources API which participates in the Spark SQL DataFrame Catalyst Optimizer. We dive deep into the super-advanced Cassandra's open source implementation @ github.com/datastax/spark-cassandra-connector. We discuss data locality, cluster deployment - as well as the pros and cons of mixing OLAP and OLTP workloads.
We also implement a SimpleDataSource which is a basic implementation of the DataSources API.
All analysis is done with Apache Zeppelin.
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
Title
Real-time, Advanced Analytics and Recommendations using Machine Learning, Graph Processing, Natural Language Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
BONUS: Netflix Recommendations: Then and Now
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
BONUS: Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Feature Talk: Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird
Talk Abstract: Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Talk Agenda:
• Intro
• Live, Interactive Recommendations Demo
• Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird (advancedspark.com)
• Types of Similarity
• Euclidean vs. Non-Euclidean Similarity
• Jaccard Similarity
• Cosine Similarity
• LogLikelihood Similarity
• Edit Distance
• Text-based Similarities and Analytics
• Word2Vec
• LDA Topic Extraction
• TextRank
• Similarity-based Recommendations
• User-to-User
• Content-based, Item-to-Item (Amazon)
• Collaborative-based, User-to-Item (Netflix)
• Graph-based, Item-to-Item "Pathways" (Spotify)
• Aggregations, Approximations, and Similarities at Scale
• Twitter Algebird
• MinHash and Bucketing
• Locality Sensitive Hashing (LSH)
• BloomFilters
• CountMin Sketch
• HyperLogLog
• Q & A
Speaker Bio: Chris Fregly is a Research Engineer @ Flux Capacitor AI in SF, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
This talk highlights the Data Sources API which participates in the Spark SQL DataFrame Catalyst Optimizer. We dive deep into the super-advanced Cassandra's open source implementation @ github.com/datastax/spark-cassandra-connector. We discuss data locality, cluster deployment - as well as the pros and cons of mixing OLAP and OLTP workloads.
We also implement a SimpleDataSource which is a basic implementation of the DataSources API.
All analysis is done with Apache Zeppelin.
A verse by verse commentary on 1 Samuel 20 dealing with David and Jonathan as examples of ideal friendship. Saul becomes angry at Jonathan and throws a spear at him.
The term ‘technical debt' and the challenges it can bring are becoming more widely understood and discussed by IT practitioners, vendor managers and business leaders. If you're looking at technical debt in your organization, or already thinking about measuring technical debt with your vendors, you will find this report useful.
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
Zurich, Berlin, Vienna Spark Meetup Nov 02 2015
* Title *
Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
Title: Real-Time Training and Deploying Spark ML Recommendations With Kafka and NetflixOSS
Speaker: Chris Fregly (https://linkedin.com/in/cfregly/)
Date: Monday, October 17, 2016
Event: https://meetup.com/Athens-Big-Data/events/234546355/
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark’s script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook’s custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark’s script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Anya Bida
Abstract: Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully.
Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018
https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s
Bio: Anya Bida (https://www.linkedin.com/in/anyabida/)
Just enough DevOps for Data Scientists (Part II)Databricks
Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully.
Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018
https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScyllaDB
Scylla enables rapid Machine Learning experimentation at Tubi. The current-generation personalization service, Ranking Service, ramps up experimentation by 5x, while Popper, the next-generation experimentation engine, will grow by 10x and beyond. We'll talk about what's so special about these services.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.
"Technical Challenges behind Visual IDE for React Components" Tetiana MandziukFwdays
During this talk, you will get acquainted with a new product inside the Wix ecosystem — Wix Components Studio. It is a visual IDE for React Components that enables team members from all disciplines to easily access, validate and discuss their components on the same platform. We will review the building blocks needed to assemble a visual IDE and the technical challenges we are dealing with. Specifically, we will discuss pluggable architecture (and what that means), code analysis and generation, schema extraction, and mechanism for data synchronization in different environments. A short demo is also included!
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other initiatives that are coming in the future. In this talk, we want to share with the Bogota Spark community an overview of Spark 3.0 features and enhancements.
In particular, we will touch upon the following areas:
* Performance Improvement Features
* Improved Useability Features
* ANSI SQL Compliance
* Pandas UDFs
* Compatibility and migration considerations
* Spark Ecosystem: Delta Lake, Project Hydrogen, and Project Zen
Pandas on AWS - Let me count the ways.pdfChris Fregly
Chris Fregly (Principal Solution Architect, AI and machine learning at AWS) will give a brief presentation on the various ways to perform scalable Pandas, Modin, and Ray on AWS. He will then answer questions from the audience and moderator, Alejandro Herrera (whatever he is) at Ponder.
Chris Fregly is a Principal Solution Architect for AI and Machine Learning at Amazon Web Services (AWS) based in San Francisco, California. He is the organizer of the Global Data Science on AWS meetup. He is co-author of the O'Reilly Book, "Data Science on AWS."
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: Ray Overview, Ray AI Runtime on AWS using Amazon SageMaker, EC2, EMR, EKS by Chris Fregly, Principal Specialist Solution Architect, AI and Machine Learning @ AWS
Talk #2: Deep-dive Blueprints for Amazon Elastic Kubernetes Service (EKS) including Ray and Spark by Apoorva Kulkarni, Sr. Specialist Solution Architect, Containers and Kubernetes @ AWS
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
Amazon reInvent 2020 Recap: AI and Machine Learning
Video here: https://youtu.be/YSXe02Y5pHM
NEW RELEASE! Build, Automate, Manage, and Scale ML Workflows with the NEW Amazon SageMaker Pipelines by Hallie Crosby Weishahn.
Description of Talk and Demo
AWS recently announced Amazon SageMaker Pipelines (https://aws.amazon.com/sagemaker/pipelines/), the first purpose-built, easy-to-use Continuous Integration and Continuous Delivery (CI/CD) service for machine learning.
SageMaker Pipelines has three main components which improve the operational resilience and reproducibility of your workflows: 1) pipelines, 2) model registry, and 3) projects.
In this talk and demo, Hallie will walk us through the new Amazon SageMaker Pipelines feature including MLOps support.
Date/Time
9-10am US Pacific Time (Third Monday of Every Month)
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Meetup:
https://www.meetup.com/Data-Science-on-AWS/
Zoom:
https://zoom.us/j/690414331
Webinar ID: 690 414 331
Phone:
+1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll)
Related Links
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
O'Reilly Book: https://datascienceonaws.com
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Support: https://support.pipeline.ai
Monthly Workshop: https://www.eventbrite.com/e/full-day-workshop-kubeflow-gpu-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-tickets-63362929227
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
Waking the Data Scientist at 2am:
Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor
In this talk, I describe how to deploy a model into production and monitor its performance using SageMaker Model Monitor. With Model Monitor, I can detect if a model's predictive performance has degraded - and alert an on-call data scientist to take action and improve the model at 2am while the DevOps folks sleep soundly through the night.
Topics: AI and Machine Learning, Model Deployment, Anomaly Detection, Amazon SageMaker Endpoints, and Model Monitor
Quantum Computing with Amazon Braket
In this talk, I describe some fundamental principles of quantum computing including qu-bits, superposition, and entanglement. I will demonstrate how to perform secure quantum computing tasks across many Quantum Processing Units (QPUs) using Amazon Braket, IAM, and S3.
AI and Machine Learning, Quantum Computing, Amazon Braket, QPU
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
In this talk, we present tips and best practices for scaling a large workshop for 1,000's of simultaneous attendees - both online and in-person. While our workshop is focused on AI and machine learning on AWS, we generalize our learnings for any domain or specialization.
Video: https://youtu.be/T0L0JxDaPkc
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, Airflow, and MLflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning and data engineering.
MLflow is a lightweight experiment-tracking system recently open-sourced by Databricks, the creators of Apache Spark. MLflow supports Python, Java/Scala, and R - and offers native support for TensorFlow, Keras, and Scikit-Learn.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
The link will be sent a few hours before the start of the workshop.
Only registered users will receive the link.
If you do not receive the link a few hours before the start of the workshop, please send your Eventbrite registration confirmation to support@pipeline.ai for help.
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Run Multiple Experiments with MLflow Experiment Tracking
12. Reproduce Model Training with TFX Metadata Store
13. Deploy the Model to Production with TensorFlow Serving and Istio
14. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
https://youtu.be/T0L0JxDaPkc
Title
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter + TPU
Video
https://youtu.be/vaB4IM6ySD0
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Reproduce Model Training with TFX Metadata Store and Pachyderm
12. Deploy the Model to Production with TensorFlow Serving and Istio
13. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
Related Links
1. PipelineAI Home: https://pipeline.ai
2. PipelineAI Community Edition: http://community.pipeline.ai
3. PipelineAI GitHub: https://github.com/PipelineAI/pipeline
4. Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
5. YouTube Videos: https://youtube.pipeline.ai
6. SlideShare Presentations: https://slideshare.pipeline.ai
7. Slack Support: https://joinslack.pipeline.ai
8. Web Support and Knowledge Base: https://support.pipeline.ai
9. Email Support: support@pipeline.ai
Speaker: Umayah Abdennabi
Agenda
* Intro Grammarly (Umayah Abdennabi, 5 mins)
* Meetup Updates and Announcements (Chris, 5 mins)
* Custom Functions in Spark SQL (30 mins)
Speaker: Umayah Abdennabi
Spark comes with a rich Expression library that can be extended to make custom expressions. We will look into custom expressions and why you would want to use them.
* TF 2.0 + Keras (30 mins)
Speaker: Francesco Mosconi
Tensorflow 2.0 was announced at the March TF Dev Summit, and it brings many changes and upgrades. The most significant change is the inclusion of Keras as the default model building API. In this talk, we'll review the main changes introduced in TF 2.0 and highlight the differences between open source Keras and tf.keras
* SQUAD Deep-Dive: Question & Answer with Context (45 mins)
Speaker: Brett Koonce (https://quarkworks.co)
SQuAD (Stanford Question Answer Dataset) is an NLP challenge based around answering questions by reading Wikipedia articles, designed to be a real-world machine learning benchmark. We will look at several different ways to tackle the SQuAD problem, building up to state of the art approaches in terms of time, complexity, and accuracy.
https://rajpurkar.github.io/SQuAD-explorer/
https://dawn.cs.stanford.edu/benchmark/#squad
Food and drinks will be provided. The event will be held at Grammarly's office at One Embarcadero Center on the 9th floor. When you arrive at One Embarcadero, take the escalator to the second floor where you will find the lobby and elevators to the office suites. Come on up to the 9th floor (no need to check in at security), and ring the Grammarly doorbell.
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
Traditional machine learning pipelines end with life-less models sitting on disk in the research lab. These traditional models are typically trained on stale, offline, historical batch data. Static models and stale data are not sufficient to power today's modern, AI-first Enterprises that require continuous model training, continuous model optimizations, and lightning-fast model experiments directly in production. Through a series of open source, hands-on demos and exercises, we will use PipelineAI to breathe life into these models using 4 new techniques that we’ve pioneered:
* Continuous Validation (V)
* Continuous Optimizing (O)
* Continuous Training (T)
* Continuous Explainability (E).
The Continuous "VOTE" techniques has proven to maximize pipeline efficiency, minimize pipeline costs, and increase pipeline insight at every stage from continuous model training (offline) to live model serving (online.)
Attendees will learn to create continuous machine learning pipelines in production with PipelineAI, TensorFlow, and Kafka.
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
Perform Online Predictions using Slack
A/B and multi-armed bandit model compare
Train Online Models with Kafka Streams
Create new models quickly
Deploy to production safely
Mirror traffic to validate online performance
Any Framework, Any Hardware, Any Cloud
Dashboard to manage the lifecycle of models from local development to live production
Generates optimized runtimes for the models
Custom targeting rules, shadow mode, and percentage-based rollouts to safely test features in live production
Continuous model training, model validation, and pipeline optimization
https://youtu.be/zpkH9oiIovU
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/258276286/
Related Links
PipelineAI Home: https://pipeline.ai
PipelineAI Community Edition: https://community.pipeline.ai
PipelineAI GitHub: https://github.com/PipelineAI/pipeline
PipelineAI Quick Start: https://quickstart.pipeline.ai
Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
YouTube Videos: https://youtube.pipeline.ai
SlideShare Presentations: https://slideshare.pipeline.ai
Slack Support:
https://joinslack.pipeline.ai
Web Support and Knowledge Base: https://support.pipeline.ai
Email Support: help@pipeline.ai
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters - and model pipeline phases - that have never been exposed until now.
While most Hyperparameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing - among many other framework and hardware-specific optimizations.
Next, we introduce hyperparameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU).
Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.
Bio
Chris Fregly is Founder and Applied AI Engineer at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.
He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High Performance TensorFlow in Production with Kubernetes and GPUs."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
https://pipeline.ai
With PipelineAI, You Can…
* Generate Hardware-Specific Model Optimizations
* Deploy and Compare Models in Live Production
* Optimize Complete AI Pipeline Across Many Models
* Hyper-Parameter Tune Both Training & Predicting Phases
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/244971261/
Based on this blog post: https://mengdong.github.io/2017/07/15/distributed-tensorflow-with-gpu-on-kubernetes-and-mapr/
youtube video:
https://www.youtube.com/watch?v=3phz1_B-rR4
http://pipeline.ai
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
Online Workshop
Note: A GPU-based cloud instance will be provided to each attendee for the duration of this event!!
At 8am PT on the morning of this workshop, we will email the Webinar details to your email address registered with Eventbrite.
If this email address is not up to date - or you do not get the email by 8am PT - please email your Eventbrite confirmation to help@pipeline.ai and we'll send you the details.
http://pipeline.ai
Title
PipelineAI Distributed Spark ML + Tensorflow AI + GPU Workshop
Time
Start: 9am PT Time
End: 1pm PT Time
Highlights
We will each build an end-to-end, continuous Tensorflow AI model training and deployment pipeline on our own GPU-based cloud instance.
At the end, we will combine our cloud instances to create the LARGEST Distributed Tensorflow AI Training and Serving Cluster in the WORLD!
Pre-requisites
Just a modern browser, internet connection, and a good night's sleep! We'll provide the rest.
Agenda
Spark ML
TensorFlow AI
Storing and Serving Models with HDFS
Trade-offs of CPU vs. *GPU, Scale Up vs. Scale Out
CUDA + cuDNN GPU Development Overview
TensorFlow Model Checkpointing, Saving, Exporting, and Importing
Distributed TensorFlow AI Model Training (Distributed Tensorflow)
TensorFlow's Accelerated Linear Algebra Framework (XLA)
TensorFlow's Just-in-Time (JIT) Compiler, Ahead of Time (AOT) Compiler
Centralized Logging and Visualizing of Distributed TensorFlow Training (Tensorboard)
Distributed Tensorflow AI Model Serving/Predicting (TensorFlow Serving)
Centralized Logging and Metrics Collection (Prometheus, Grafana)
Continuous TensorFlow AI Model Deployment (TensorFlow, Airflow)
Hybrid Cross-Cloud and On-Premise Deployments (Kubernetes)
High-Performance and Fault-Tolerant Micro-services (NetflixOSS)
More Info including GitHub and Docker Repos
http://pipeline.ai
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Helsinki Spark Meetup Nov 20 2015
1. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
After Dark 1.5
High Performance, Real-time, Streaming,
Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations
Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
November 20, 2015
2. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2
Streaming Data Engineer
Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Founder
Advanced Apache Meetup
Author
Advanced .
Due 2016
My Ma’s First Time in California
3. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Random Slide: More Ma “First Time” Pics
3
In California
Using Chopsticks
Using “New” iPhone
4. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
4
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Austin Data Days Conference (Jan 2016)
5. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
1600+ Members in just 4 mos!
Top 5 Most Active Spark Meetup!!
Meetup Goals
Dig deep into codebase of Spark and related projects
Study integrations of Cassandra, ElasticSearch,
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
Surface and share patterns and idioms of these
well-designed, distributed, big data components
6. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All Slides and Code Are Available!
advancedspark.com
slideshare.net/cfregly
github.com/fluxcapacitor
hub.docker.com/r/fluxcapacitor
6
7. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What is “
After Dark”?
Spark-based, Advanced Analytics Reference App
End-to-End, Scalable, Real-time Big Data Pipeline
Demonstration of Spark & Related Big Data Projects
7
github.com/fluxcapacitor
8. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Tools of This Talk
8
Kafka
Redis
Docker
Ganglia
Cassandra
Parquet, JSON, ORC, Avro
Apache Zeppelin Notebooks
Spark SQL, DataFrames, Hive
ElasticSearch, Logstash, Kibana
Spark ML, GraphX, Stanford CoreNLP
…
github.com/fluxcapacitor
hub.docker.com/r/fluxcapacitor
9. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Themes of this Talk
Filter
Off-Heap
Parallelize
Approximate
Find Similarity
Minimize Seeks
Maximize Scans
Customize for Workload
Tune Performance At Every Layer
9
Be Nice, Collaborate!
Like a Mom!!
10. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations
10
11. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark Core: Tuning & Mechanical Sympathy
Understand and Acknowledge Mechanical Sympathy
Study AlphaSort and 100Tb GraySort Challenge
Dive Deep into Project Tungsten
11
12. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Mechanical Sympathy
Hardware and software working together in harmony.
- Martin Thompson
http://mechanical-sympathy.blogspot.com
Whatever your data structure, my array will beat it.
- Scott Meyers
Every C++ Book, basically
12
Hair
Sympathy
- Bruce Jenner
13. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
13
Project
Tungsten
(Spark 1.4-1.6+)
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory and GC
Maximize CPU Cache Locality
Saturate Network I/O
Saturate Disk I/O
14. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AlphaSort Technique: Sort 100 Bytes Recs
14
Value
Ptr
Key
Dereference Not Required!
AlphaSort
List [(Key, Pointer)]
Key is directly available for comparison
Naïve
List [Pointer]
Must dereference key for comparison
Ptr
Dereference for Key Comparison
Key
15. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Line and Memory Sympathy
Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs
= 14 bytes
15
Key
Ptr
Not CPU Cache-line Friendly!
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes)
= 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)
= 16 bytes
Key
Ptr
Pad
/Pad
CPU Cache-line Friendly!
16. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Comparison
16
17. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Trick: Direct Cache Access (DCA)
Pull out packet header along side pointer to payload
17
18. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Line Sizes
18
My
Laptop
My
SoftLayer
BareMetal
19. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cache Hits: Sequential v Random Access
19
20. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Mechanical Sympathy
CPU Cache Lines and Matrix Multiplication
20
21. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
21
Bad: Row-wise traversal,
not using CPU cache line,
ineffective pre-fetching
22. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
22
Good: Full CPU cache line,
effective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference j
before k
23. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
23
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
24. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Compare CPU Naïve & Cache-Friendly Matrix Multiplication
24
25. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
25
Cache-Friendly Matrix Multiply
~27x
~13x
~13x
~2x
perf stat -XX:-Inline –event
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
~10x
55 hp
550 hp
26. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Mechanical Sympathy
CPU Cache Lines and Lock-Free Thread Sync
26
27. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Tuple Counters
object CacheNaiveTupleIncrement {
var tuple = (0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = {
this.synchronized {
tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement)
tuple
}
}
}
27
28. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Case Class Counters
case class MyTuple(left: Int, right: Int)
object CacheNaiveCaseClassCounters {
var tuple = new MyTuple(0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = {
this.synchronized {
tuple = new MyTuple(tuple.left + leftIncrement,
tuple.right + rightIncrement)
tuple
}
}
}
28
29. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Lock-Free Counters
object CacheFriendlyLockFreeCounters {
// a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each)
val tuple = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalLong = 0L
var updatedLong = 0L
do {
originalLong = tuple.get()
val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter
val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter
val updatedRightInt = originalRightInt + rightIncrement // increment right counter
val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter
updatedLong = updatedLeftInt // update the new long with the left counter
updatedLong = updatedLong << 32 // shift the new long left
updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false)
updatedLong
}
29
Q: Why not @volatile long?
A: Java Memory Model
does not guarantee synchronous
updates of 64-bit longs or doubles
30. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
30
31. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Counters Comparison
Naïve Tuple Counters
Naïve Case Class Counters
Cache Friendly Lock-Free Counters
~2x
~1.5x
~3.5x
~2x
~2x
~1.5x
~1.5x
~1.5x
32. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Profiling Visualizations: Flame Graphs
32
Example: Spark Word Count
Java Stack Traces
(-XX:+PreserveFramePointer)
Plateaus
are Bad!!
33. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
100TB Daytona GraySort Challenge
Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuffle
Saturate Network and Disk Controllers
33
34. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Results
34
Spark Goals
Saturate Network I/O
Saturate Disk I/O
(2013) (2014)
35. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Hardware Configuration
Compute
206 Workers, 1 Master (AWS EC2 i2.8xlarge)
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4
3 GBps mixed read/write disk I/O per node
Network
AWS Placement Groups, VPC, Enhanced Networking
Single Root I/O Virtualization (SR-IOV)
10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
35
36. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Software Configuration
Spark 1.2, OpenJDK 1.7
Disable caching, compression, spec execution, shuffle spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit local reads, 2x replication
Empirically chose between 4-6 partitions per cpu
206 nodes * 32 cores = 6592 cores
6592 cores * 4 = 26,368 partitions
6592 cores * 6 = 39,552 partitions
6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace
Required ~10s of sampling 79 keys from in each partition
36
37. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Sort Shuffle Manager for Spark 1.2
Original “hash-based”
New “sort-based”
① Use less OS resources (socket buffers, file descriptors)
② TimSort partitions in-memory
③ MergeSort partitions on-disk into a single master file
④ Serve partitions from master file: seek once, sequential scan
37
38. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Asynchronous Network Module
Switch to asyncronous Netty vs. synchronous java.nio
Switch to zero-copy epoll
Use only kernel-space between disk and network controllers
Custom memory management
spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning
spark.shuffle.io.preferDirectBuffers=true
Reuse off-heap buffers
spark.shuffle.io.numConnectionsPerPeer=8 (for example)
Increase to saturate hosts with multiple disks (8x800 SSD)
38
Details in
SPARK-2468
39. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Algorithms and Data Structures
Optimized for sort & shuffle workloads
o.a.s.util.collection.TimSort[K,V]
Based on JDK 1.7 TimSort
Performs best with partially-sorted runs
Optimized for elements of (K,V) pairs
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap
Open addressing hash, quadratic probing
Array of [(K, V), (K, V)]
Good memory locality
Keys never removed, values only append
39
40. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Daytona GraySort Challenge Goal Success
1.1 Gbps/node network I/O (Reducers)
Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)
40
Aggregate
Cluster
Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node
41. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (Deprecated)
spark.shuffle.consolidateFiles (Mapper)
o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files
Increase spark.shuffle.file.buffer (Reducer)
Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors
Minimizes intermediate files and overall shuffle
More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin
spark.sql.autoBroadcastJoinThreshold
Use DataFrame.explain(true) or EXPLAIN to verify
41
Many Threads
(1 per CPU)
42. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Project Tungsten
Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
42
SPARK-7076
(Spark 1.4)
43. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Quick Review of Project Tungsten Jiras
43
SPARK-7076
(Spark 1.4)
44. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
CPU is used for serialization, hashing, compression!
Network and Disk I/O bandwidth are relatively high
GraySort optimizations improved network & shuffle
Partitioning, pruning, and predicate pushdowns
Binary, compressed, columnar file formats (Parquet)
44
45. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Yet Another Spark Shuffle Manager!
spark.shuffle.manager =
hash (Deprecated)
< 10,000 reducers
Output partition file hashes the key of (K,V) pair
Mapper creates an output file per partition
Leads to M*P output files for all partitions
sort (GraySort Challenge)
> 10,000 reducers
Default from Spark 1.2-1.5
Mapper creates single output file for all partitions
Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory
Uses custom data structures and algorithms for sort-shuffle workload
Wins Daytona GraySort Challenge
tungsten-sort (Project Tungsten)
Default since 1.5
Modification of existing sort-based shuffle
Uses com.misc.Unsafe for self-managed memory and garbage collection
Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms
Perform joins, sorts, and other operators on both serialized and compressed byte buffers
45
46. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory
Reduces GC overhead
Both on and off heap
Exact size calculations
Direct Binary Processing
Operate on serialized/compressed arrays
Kryo can reorder/sort serialized records
LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms
o.a.s.sql.catalyst.expression.UnsafeRow
o.a.s.unsafe.map.BytesToBytesMap
Code Generation (default in 1.5)
Generate source code from overall query plan
100+ UDFs converted to use code generation
46
UnsafeFixedWithAggregationMap
TungstenAggregationIterator
CodeGenerator
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat
UnsafeShuffleSortDataFormat
PackedRecordPointer
UnsafeRow
UnsafeInMemorySorter
UnsafeExternalSorter
UnsafeShuffleWriter
Mostly Same Join Code,
UnsafeProjection
UnsafeShuffleManager
UnsafeShuffleInMemorySorter
UnsafeShuffleExternalSorter
Details in
SPARK-7075
47. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
sun.misc.Unsafe
47
Info
addressSize()
pageSize()
Objects
allocateInstance()
objectFieldOffset()
Classes
staticFieldOffset()
defineClass()
defineAnonymousClass()
ensureClassInitialized()
Synchronization
monitorEnter()
tryMonitorEnter()
monitorExit()
compareAndSwapInt()
putOrderedInt()
Arrays
arrayBaseOffset()
arrayIndexScale()
Memory
allocateMemory()
copyMemory()
freeMemory()
getAddress() – not guaranteed after GC
getInt()/putInt()
getBoolean()/putBoolean()
getByte()/putByte()
getShort()/putShort()
getLong()/putLong()
getFloat()/putFloat()
getDouble()/putDouble()
getObjectVolatile()/putObjectVolatile()
Used by
Tungsten
49. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Traditional Java Object Row Layout
4-byte String
Multi-field Object
49
50. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structures for Workload
UnsafeRow
(Dense Binary Row)
TaskMemoryManager
(Virtual Memory Address)
BytesToBytesMap
(Dense Binary HashMap)
50
Dense, 8-bytes per field (word-aligned)
Key
Ptr
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging
51. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeRow Layout Example
51
Pre-Tungsten
Tungsten
52. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Memory Management
o.a.s.memory.
TaskMemoryManager & MemoryConsumer
Memory management: virtual memory allocation, pageing
Off-heap: direct 64-bit address
On-heap: 13-bit page num + 27-bit page offset
o.a.s.shuffle.sort.
PackedRecordPointer
64-bit word
(24-bit partition key, (13-bit page num, 27-bit page offset))
o.a.s.unsafe.types.
UTF8String
Primitive Array[Byte]
52
2^13 pages * 2^27 page size = 1 TB RAM per Task
53. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeFixedWidthAggregationMap
Aggregations
o.a.s.sql.execution.
UnsafeFixedWidthAggregationMap
Uses BytesToBytesMap
In-place updates of serialized data
No object creation on hot-path
Improved external agg support
No OOM’s for large, single key aggs
o.a.s.sql.catalyst.expression.codegen.
GenerateUnsafeRowJoiner
Combine 2 UnsafeRows into 1
o.a.s.sql.execution.aggregate.
TungstenAggregate & TungstenAggregationIterator
Operates directly on serialized, binary UnsafeRow
2 Steps: hash-based agg (grouping), then sort-based agg
Supports spilling and external merge sorting
53
54. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Equality
Bitwise comparison on UnsafeRow
No need to calculate equals(), hashCode()
Row 1
Equals!
Row 2
54
55. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Joins
Surprisingly, not many code changes
o.a.s.sql.catalyst.expressions.
UnsafeProjection
Converts InternalRow to UnsafeRow
55
56. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Sorting
o.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormat
UnsafeInMemorySorter
UnsafeExternalSorter
RecordPointerAndKeyPrefix
UnsafeShuffleWriter
AlphaSort-Style Cache Friendly
56
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat
simultaneously will prevent JIT inlining.
This affects sort & shuffle performance.
Supports merging compressed records
if compression CODEC supports it (LZF)
57. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spilling
Efficient Spilling
Exact data size is known
No need to maintain heuristics & approximations
Controls amount of spilling
Spill merge on compressed, binary records!
If compression CODEC supports it
57
UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()
Exact Peak Memory
for Spark Jobs
58. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Code Generation
Problem
Boxing causes excessive object creation
Expensive expression tree evals per row
JVM can’t inline polymorphic impls
Solution
Codegen by-passes virtual function calls
Defer source code generation to each operator, UDF, UDAF
Use Scala quasiquote macros for Scala AST source code gen
Rewrite and optimize code for overall plan, 8-byte align, etc
Use Janino to compile generated source code into bytecode
58
59. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Each Implements
Expression.genCode()!
60. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom UDF with Codegen
Study existing implementations
https://github.com/apache/spark/pull/7214/files
Extend base trait
o.a.s.sql.catalyst.expressions.Expression.genCode()
Register the function
o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()
Augment DataFrame with new UDF (Scala implicits)
o.a.s.sql.functions.scala
Don’t forget about Python!
python.pyspark.sql.functions.py
60
61. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Benefits from Project Tungsten?
Users of DataFrames
All Spark SQL Queries
Catalyst
All RDDs
Serialization, Compression, and Aggregations
61
62. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Project Tungsten Performance Results
Query Time
Garbage
Collection
62
OOM’d on
Large Dataset!
63. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations
63
64. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark SQL: Query Optimizing & Catalyst
Explore DataFrames/Datasets/DataSources, Catalyst
Review Partitions, Pruning, Pushdowns, File Formats
Create a Custom DataSource API Implementation
64
65. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataFrames
Inspired by R and Pandas DataFrames
Schema-aware
Cross language support
SQL, Python, Scala, Java, R
Levels performance of Python, Scala, Java, and R
Generates JVM bytecode vs serializing to Python
DataFrame is container for logical plan
Lazy transformations represented as tree
Only logical plan is sent from Python -> JVM
Only results returned from JVM -> Python
UDF and UDAF Support
Custom UDF support using registerFunction()
Experimental UDAF support (ie. HyperLogLog)
Supports existing Hive metastore if available
Small, file-based Hive metastore created if not available
*DataFrame.rdd returns underlying RDD if needed
65
Use DataFrames
instead of RDDs!!
66. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Hive
Early days, Shark was “Hive on Spark”
Hive Optimizer slowly replaced with Catalyst
Always use HiveContext – even if not using Hive!
If no Hive, a small Hive metastore file is created
Spark 1.5+ supports all Hive versions 0.12+
Separate classloaders for isolation
Breaks dependency between Spark internal Hive
version
and User’s external Hive version
66
67. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Catalyst Optimizer
Optimize DataFrame Transformation Tree
Subquery elimination: use aliases to collapse subqueries
Constant folding: replace expression with constant
Simplify filters: remove unnecessary filters
Predicate/filter pushdowns: avoid unnecessary data load
Projection collapsing: avoid unnecessary projections
Create Custom Rules
Rules are Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)
67
Implements
oas.sql.catalyst.rules.Rule
Apply to any plan stage
68. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataSources API
Relations (o.a.s.sql.sources.interfaces.scala)
BaseRelation (abstract class): Provides schema of data
TableScan (impl): Read all data from source
PrunedFilteredScan (impl): Column pruning & predicate pushdowns
InsertableRelation (impl): Insert/overwrite data based on SaveMode
RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala)
RunnableCommand (trait/interface): Common commands like EXPLAIN
ExplainCommand(impl: case class)
CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class): Handles all predicates/filters supported by this source
EqualTo (impl)
GreaterThan (impl)
StringStartsWith (impl)
68
69. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Native Spark SQL DataSources
69
70. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Debugging
70
gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.executedPlan
71. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Visualization & Metrics
71
Effectiveness
of Filter
CPU Cache
Friendly
Binary Format
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
Peak Memory for
Joins and Aggs
72. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JSON Data Source
DataFrame
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")
72
json() convenience method
73. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JDBC Data Source
Add Driver to Spark JVM System Classpath
$ export SPARK_CLASSPATH=<jdbc-driver.jar>
DataFrame
val jdbcConfig = Map("driver" -> "org.postgresql.Driver",
"url" -> "jdbc:postgresql:hostname:port/database",
"dbtable" -> ”schema.tablename")
df.read.format("jdbc").options(jdbcConfig).load()
SQL
CREATE TABLE genders USING jdbc
OPTIONS (url, dbtable, driver, …)
73
74. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Data Source
Configuration
spark.sql.parquet.filterPushdown=true
spark.sql.parquet.mergeSchema=true
spark.sql.parquet.cacheMetadata=true
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames
val gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")
gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL
CREATE TABLE genders USING parquet
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.parquet")
74
75. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ORC Data Source
Configuration
spark.sql.orc.filterPushdown=true
DataFrames
val gendersDF = sqlContext.read.format("orc")
.load("file:/root/pipeline/datasets/dating/genders")
gendersDF.write.format("orc").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders")
SQL
CREATE TABLE genders USING orc
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders")
75
76. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Third-Party Spark SQL DataSources
76
spark-packages.org
77. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CSV DataSource (Databricks)
Github
https://github.com/databricks/spark-csv
Maven
com.databricks:spark-csv_2.10:1.2.0
Code
val gendersCsvDF = sqlContext.read
.format("com.databricks.spark.csv")
.load("file:/root/pipeline/datasets/dating/gender.csv.bz2")
.toDF("id", "gender")
77
toDF() is required if CSV does not contain header
78. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ElasticSearch DataSource (Elastic.co)
Github
https://github.com/elastic/elasticsearch-hadoop
Maven
org.elasticsearch:elasticsearch-spark_2.10:2.1.0
Code
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document-type>")
78
79. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Elasticsearch Tips
Change id field to not_analyzed to avoid indexing
Use term filter to build and cache the query
Perform multiple aggregations in a single request
Adapt scoring function to current trends at query time
79
80. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AWS Redshift Data Source (Databricks)
Github
https://github.com/databricks/spark-redshift
Maven
com.databricks:spark-redshift:0.5.0
Code
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://tmpdir")
.load(...)
80
UNLOAD and copy to tmp
bucket in S3 enables
parallel reads
81. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DB2 and BigSQL DataSources (IBM)
Coming Soon!
81
82. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra DataSource (DataStax)
Github
https://github.com/datastax/spark-cassandra-connector
Maven
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>",
"table"->"<table>")).save(…)
82
83. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra Pushdown Support
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala
Pushdown Predicate Rules
1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate
2. Only push down primary key column predicates with = or IN predicate.
3. If there are regular columns in the pushdown predicates, they should have
at least one EQ expression on an indexed column and no IN predicates.
4. All partition column predicates must be included in the predicates to be pushed down,
only the last part of the partition key can be an IN predicate. For each partition column,
only one predicate is allowed.
5. For cluster column predicates, only last predicate can be non-EQ predicate
including IN predicate, and preceding column predicates must be EQ predicates.
If there is only one cluster column predicate, the predicates could be any non-IN predicate.
6. There is no pushdown predicates if there is any OR condition or NOT IN condition.
7. We're not allowed to push down multiple predicates for the same column if any of them
is equality or IN predicate.
83
84. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Cassandra DataSource
By-pass CQL optimized for transactional data
Instead, do bulk reads/writes directly on SSTables
Similar to 5 year old Netflix Open Source project Aegisthus
Promotes Cassandra to first-class Analytics Option
Potentially only part of DataStax Enterprise?!
Please mail a nasty letter to your local DataStax office
84
85. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Rumor of REST DataSource (Databricks)
Coming Soon?
Ask Michael Armbrust
Spark SQL Lead @ Databricks
85
86. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom DataSource (Me and You!)
Coming Right Now!
86
DEMO ALERT!!
87. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Create a Custom DataSource
Study Existing Native & Third-Party Data Sources
Native
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)
class JDBCRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
Third-Party
DataStax Cassandra (o.a.s.sql.cassandra)
class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation!
87
88. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Create a Custom DataSource
88
89. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Contribute a Custom Data Source
spark-packages.org
Managed by
Contains links to external github projects
Ratings and comments
Declare Spark version support for each package
Examples
https://github.com/databricks/spark-csv
https://github.com/databricks/spark-avro
https://github.com/databricks/spark-redshift
89
90. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Columnar File Format
Based on Google Dremel
Collaboration with Twitter and Cloudera
Self-describing, evolving schema
Fast columnar aggregation
Supports filter pushdowns
Columnar storage format
Excellent compression
90
91. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Compression
Run Length Encoding: Repeated data
Dictionary Encoding: Fixed set of values
Delta, Prefix Encoding: Sorted data
91
92. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Demonstrate File Formats, Partition Schemes, and Query Plans
92
93. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Hive JDBC ODBC ThriftServer
Allow BI Tools to Query and Process Spark Data
Register Permanent Table
CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT)
USING org.apache.spark.sql.json
OPTIONS (path "datasets/dating/ratings.json.bz2")
Register Temp Table
ratingsDF.registerTempTable("ratings_temp")
Configuration
spark.sql.thriftServer.incrementalCollect=true
spark.driver.maxResultSize > 10gb (default)
93
94. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Query and Process Spark Data from BI Tools
94
95. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations
95
96. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark Streaming: Scaling & Approximations
Discuss Delivery Guarantees, Parallelism, and Stability
Compare Receiver and Receiver-less Impls
Demonstrate Stream Approximations
96
97. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Non-Parallel Receiver Implementation
97
98. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Receiver Implementation (Kinesis)
KinesisRDD partitions store relevant offsets
Single receiver required to see all data/offsets
Kinesis offsets not deterministic like Kafka
Partitions rebuild from Kinesis using offsets
No Write Ahead Log (WAL) needed
Optimizes happy path by avoiding the WAL
At least once delivery guarantee
98
99. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parallel Receiver-less Implementation (Kafka)
99
100. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Receiver-less Implementation (Kafka)
KafkaRDD partitions store relevant offsets
Each partition acts as a Receiver
Tasks/Executors pull from Kafka in parallel
Partitions rebuild from Kafka using offsets
No Write Ahead Log (WAL) needed
Optimizes happy path by avoiding the WAL
At least once delivery guarantee
100
101. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Maintain Stability of Stream Processing
Rate Limiting
Since Spark 1.2
Fixed limit on number of messages per second
Potential to drops messages on the floor
Back Pressure
Since Spark 1.5 (TypeSafe Contribution)
More dynamic than rate limiting
Push back on reliable, buffered source (Kafka, Kinesis)
Fundamentals of Control Theory and Observability
101
102. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Streaming Approximations
HyperLogLog and CountMin Sketch
102
103. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog (HLL) Approx Distinct Count
Approximate count distinct
Twitter’s Algebird
Better than HashSet
Low, fixed memory
Only 1.5K, 2% error,10^9 counts (tunable)
Redis HLL: 12K per key, 0.81%, 2^64 counts
Spark’s countApproxDistinctByKey()
Streaming example in Spark codebase
103
http://research.neustar.biz/
104. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch (CMS) Approx Count
Approximate count
Twitter’s Algebird
Better than HashMap
Low, fixed memory
Known error bounds
Large num counters
Streaming example in Spark codebase
104
105. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Using HLL and CMS for Streaming Count Approximations
105
106. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Monte Carlo Simulations
From Manhattan Project (Atomic bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials
Converge on expected value
SparkPi example in Spark codebase
1 Argument: # of trials
Pi ~= # red dots
/ # total dots
* 4
106
107. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Using a Monte Carlo Simulation to Estimate Pi
107
108. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Streaming Best Practices
Get Data Out of Streaming ASAP
Processing interval may exceed batch interval
Leads to unstable streaming system
Please Don’t…
Use updateStateByKey() like an in-memory DB
Put streaming jobs on the request/response hot path
Use Separate Jobs for Different Batch Intervals
Small Batch Interval: Store raw data (Redis, Cassandra, etc)
Medium Batch Interval: Transform, join, process data
High Batch Interval: Model training
Gotchas
Tune streamingContext.remember()
Use Approximations!!
108
109. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations
109
110. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark ML: Featurizing & Recommendations
Understand Similarity and Dimension Reduction
Demonstrate Sampling and Bucketing
Generate Recommendations
110
111. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Live, Interactive Demo!
sparkafterdark.com
111
112. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Audience Participation Needed!!
112
->
You are
here
->
Audience Instructions
Navigate to sparkafterdark.com
Click 3 actresses and 3 actors
Wait for us to analyze together!
Note: This is totally anonymous!!
Project Links
https://github.com/fluxcapacitor/pipeline
https://hub.docker.com/r/fluxcapacitor
113. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Similarity
113
114. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Similarity
Euclidean
Linear-based measure
Suffers from Magnitude bias
Cosine
Angle-based measure
Adjusts for magnitude bias
Jaccard
Set intersection / union
Suffers Popularity bias
Log Likelihood
Netflix “Shawshank” Problem
Adjusts for popularity bias
114
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1!
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z!
115. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All-Pairs Similarity Comparison
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuffle: O(m*n^2); m=rows, n=cols
Minimize shuffle through approximations!
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (ie.0)
Principle Component Analysis
115
Dimension reduction!!
116. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Dimension Reduction
Sampling and Bucketing
116
117. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: DIMSUM Sampling
“Dimension Independent Matrix Square Using MR”
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)
Twitter: 40% efficiency gain vs. Cosine Similarity
117
118. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: LSH Bucketing
“Locality Sensitive Hashing”
Split m into b buckets
Use similarity hash algorithm
Requires pre-processing of data
Parallel compare bucket contents
O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
ie. 500k x 500k matrix
O(1.25e17) -> O(1.25e13); b=50
118
github.com/mrsqueeze/spark-hash
119. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce n: Remove Most Frequent Value
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Note: Choose most frequent value (may not be 0)
119
(index,value)
(index,value)
120. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Recommendations
Summary Statistics and Top-K Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
120
121. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Recommendations
Non-personalized
No preference or behavior data for user, yet
aka “Cold Start Problem”
Personalized
User-Item Similarity
Items that others with similar prefs have liked
Item-Item Similarity
Items similar to your previously-liked items
121
122. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Recommendation Terminology
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
Feature Engineering
Dimension reduction, polynomial expansion
Hyper-parameter Tuning
K-Folds Cross Validation, Grid Search
Pipelines/Workflows
Chaining together Transformers and Evaluators
122
123. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Single Machine ML Algorithms
Stay Local, Distribute As Needed
Helps migration of existing single-node algos to Spark
Convert between Spark and Pandas DataFrames
New “pdspark” package: integration w/ scikitlearn, R
123
124. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Non-Personalized Recommendations
Use Aggregate Data to Generate Recommendations
124
125. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Top Users by Like Count
“I might like users who have the most-likes overall
based on historical data.”
SparkSQL, DataFrames: Summary Stat, Aggs
125
126. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Top Influencers by Like Graph
“I might like the most-influential users in overall like graph.”
GraphX: PageRank
126
127. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Generate Non-Personalized Recommendations
127
128. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Personalized Recommendations
Understand Similarity and Personalized Recommendations
128
129. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Like Behavior of Similar Users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
129
130. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Generate Personalized Recommendations using
Collaborative Filtering & Matrix Factorization
130
131. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Text-based Profiles as Me
“Our profiles have similar keywords and named entities.
We might like each other!”
MLlib: Word2Vec, TF/IDF, k-skip n-grams
131
132. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Profiles to Previous Likes
132
“Your profile text has similar keywords and named entities to
other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity
133. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Relevant, High-Value Emails
“Your initial email references a lot of things in my profile.
I might like you for making the effort!”
MLlib: Word2Vec, TF/IDF, Entity Recognition
133
^
Her Email< My Profile
134. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Feature Engineering for Text/NLP Use Cases
134
135. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
The Future of Recommendations
135
136. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Eigenfaces: Facial Recognition
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
136
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
137. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
NLP Conversation Starter Bot!
“If your responses to my generic opening
lines are positive, I may read your profile.”
MLlib: TF/IDF, DecisionTrees,
Sentiment Analysis
137
Positive Negative
138. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
138
Maintaining the Spark
139. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
⑨ Recommendations for Couples
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar
similar
•
plots ->
<- actors
139
140. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Final Recommendation!
140
141. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Get Off the Computer & Meet People!
Thank you, Helsinki!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, CA, USA
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/fluxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/fluxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
141
142. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
More Relevant Links
http://meetup.com/Advanced-Apache-Spark-Meetup
http://advancedspark.com
http://github.com/fluxcapacitor/pipeline
http://hub.docker.com/r/fluxcapacitor/pipeline
http://sortbenchmark.org/ApacheSpark2014.pd
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches)
http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do)
https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Tutorial
http://techblog.netflix.com/2015/07/java-in-flames.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches
http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do
142
143. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
What’s Next?
143
144. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What’s Next?
Autoscaling Spark Workers
Completely Docker-based
Docker Compose and Docker Machine
Lots of Demos and Examples!
Zeppelin & IPython/Jupyter notebooks
Advanced streaming use cases
Advanced ML, Graph, and NLP use cases
Performance Tuning and Profiling
Work closely with Brendan Gregg & Netflix
Surface & share more low-level details of Spark internals
144
145. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
145
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Austin Data Days Conference (Jan 2016)
146. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation.
IBM Spark