VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
Google processes 400 petabytes of data every month and that was way back in 2007! With users generating massive amounts of data in social networking sites like Facebook and Twitter, and an increase in the use of sensor devices, the amount of data generated is only going to go up. Further, with the cost of hard-disks going down, and such data being made available to everyone, and with the advent of cloud computing, we now have the power to process such data ourselves.
What are the challenges of processing such massive amounts of data? With such data being available to every corporation, big or small, how does this change how we have been perceiving data? The talk takes you through some of the technologies used to tackle these challenges.
The talk has been tailored to suit students. It helps them relate to and appreciate the subjects they learn in their curriculum - data structures, programming languages, databases, operating systems, networking etc. At the same time, it describes some of the interesting work being done in the software industry in the areas of databases, data analysis, cloud computing etc.
First Name: Gabriel
Last Name: Omar Cotelli
Email: g.cotelli@gmail.com
Title: Willow, the interaction tour
Abstract: This talk presents Willow, a Web Interaction Library that provides all the components and tools to create an interactive AJAX-based web application, without leaving the comfort of your Smalltalk environment.
After years of refinement and with a growing ecosystem of related projects, Willow is currently used by software development companies, personal commercial products and university research projects.
We will focus on presenting the user interaction affordances, while also showcasing its features and the most relevant reifications for web commands and components.
Bio: Bachelor in CS, Smalltalker since 2004, autodidact, free-thinker. Supporter of libre knowledge, human intelligence augmentation and open source software.
Works at Mercap Software developing financial software and contributes to several open source projects.
Each day our systems deal with more data, more traffic and more complex tasks. Every large-scale system has to integrate with other internal or external systems. At the same time it should provide up to date information. On the other hand business states high availability and reliability requirements. Those two make designing of such systems difficult. In particular it is difficult to keep your data consistent.
It often happens that not all requirements can't be technically met. Fortunately there are theoretical proofs that define what can and can be met in what circumstances. During the talk I will present some of the data consistency related problems typical to high traffic systems and try to sketch possible solutions. During the talk I will mention a few words about: CAP theorem, Eventual Consistency, At Least-Once Delivery and Self-Healing Systems.
This talk presents Willow, a Web Interaction Library that provides all the components and tools to create an interactive AJAX-based web application, without leaving the comfort of your Smalltalk environment.
After years of refinement and with a growing ecosystem of related projects, Willow is currently used by software development companies, personal commercial products and university research projects.
We will focus on presenting the user interaction affordances, while also showcasing its features and the most relevant reifications for web commands and components.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
Google processes 400 petabytes of data every month and that was way back in 2007! With users generating massive amounts of data in social networking sites like Facebook and Twitter, and an increase in the use of sensor devices, the amount of data generated is only going to go up. Further, with the cost of hard-disks going down, and such data being made available to everyone, and with the advent of cloud computing, we now have the power to process such data ourselves.
What are the challenges of processing such massive amounts of data? With such data being available to every corporation, big or small, how does this change how we have been perceiving data? The talk takes you through some of the technologies used to tackle these challenges.
The talk has been tailored to suit students. It helps them relate to and appreciate the subjects they learn in their curriculum - data structures, programming languages, databases, operating systems, networking etc. At the same time, it describes some of the interesting work being done in the software industry in the areas of databases, data analysis, cloud computing etc.
First Name: Gabriel
Last Name: Omar Cotelli
Email: g.cotelli@gmail.com
Title: Willow, the interaction tour
Abstract: This talk presents Willow, a Web Interaction Library that provides all the components and tools to create an interactive AJAX-based web application, without leaving the comfort of your Smalltalk environment.
After years of refinement and with a growing ecosystem of related projects, Willow is currently used by software development companies, personal commercial products and university research projects.
We will focus on presenting the user interaction affordances, while also showcasing its features and the most relevant reifications for web commands and components.
Bio: Bachelor in CS, Smalltalker since 2004, autodidact, free-thinker. Supporter of libre knowledge, human intelligence augmentation and open source software.
Works at Mercap Software developing financial software and contributes to several open source projects.
Each day our systems deal with more data, more traffic and more complex tasks. Every large-scale system has to integrate with other internal or external systems. At the same time it should provide up to date information. On the other hand business states high availability and reliability requirements. Those two make designing of such systems difficult. In particular it is difficult to keep your data consistent.
It often happens that not all requirements can't be technically met. Fortunately there are theoretical proofs that define what can and can be met in what circumstances. During the talk I will present some of the data consistency related problems typical to high traffic systems and try to sketch possible solutions. During the talk I will mention a few words about: CAP theorem, Eventual Consistency, At Least-Once Delivery and Self-Healing Systems.
This talk presents Willow, a Web Interaction Library that provides all the components and tools to create an interactive AJAX-based web application, without leaving the comfort of your Smalltalk environment.
After years of refinement and with a growing ecosystem of related projects, Willow is currently used by software development companies, personal commercial products and university research projects.
We will focus on presenting the user interaction affordances, while also showcasing its features and the most relevant reifications for web commands and components.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
haventreddityet.com
A reddit recommendation discovery engine
James Douglas Pearce
1 / 13
2. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
HOW CAN I DISCOVER NEW SUBREDDITS?
Results seem to be skewed by a few very popular subreddits...
Simple list
No context
Does not encourage discovery!
2 / 13
3. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
Live Demo
3 / 13
4. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
ALGORITHM
Jaccard similarity: overlap of user in subreddits divided by total
Categories are defined as a set of example subreddits
4 / 13
5. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
GRAPH GERNERATION
Subgraphs generated on the fly with a series of random walks
Robert Frost rule: take the road less traveled
Hipster rule: Penalize popular, i.e. highly connected nodes
5 / 13
6. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
MORE ABOUT ME
Research:
Searching for Dark Matter at
the LHC
Big Data!
Specialize in data mining and
machine learning
Hobbies:
Kaggle competitions
Poker
Motorcycles
Boardgames
6 / 13
7. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
Questions?
7 / 13
8. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
UNIQUE CATEGORIES?
8 / 13
9. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
REDDIT BY CATEGORIES
9 / 13
10. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
JACCARD COEFFICIENT
The Jaccard coefficient can be used as a user-user type similarity measure.
JB(A) =
|A ∩ B|
|A ∪ B|
(1)
A is the set of redditors in subreddit A,
B is the set of redditors in subreddit B.
JCat
C (A) =
1
|C|
Bi∈C
JBi
(A) (2)
C is the set of sets of redditors in subreddits Bi in category C,
|C| is the size of set C (number of sets).
10 / 13
11. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
CRAWLING REDDIT WITH THE PRAW API
reddit.com is HUGE, I can only
take a small sample
The redditor-subreddit matrix I
am sampling from is sparse
I want to collect information
about as many subreddits as
possible
I want my redditor sets to
overlap
Strategy:
Collect “smart” data
Use “dumb” (Jaccard)
algorithm
11 / 13
12. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
MODEL VALIDATION
For each redditor hold out one subreddit they are part of
Make a list of the top N subreddits based on Jacquard similarity
Calculate the accuracy of that recommendation as a function of N12 / 13
13. THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
SUBGRAPH GENERATOR
Ptransition(ni) ∝ Jcanada(ni) · αntrans
· βncon
(3)
α ∈ (0, 1) is a transition decay factor
ntrans is the number of times ni has been traversed
β ∈ (0, 1) is a connectivity decay factor
ncon is ni number of connections (degree)
13 / 13