Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

•

2 likes•2,793 views

This document discusses using Spark and R for petabyte-scale data science at Comcast. It describes Comcast's top data initiatives, how Spark and SparkR enable machine learning algorithms, and an example of using a Hidden Markov Model with SparkR to analyze streaming data and detect hidden states. Performance testing showed the model processing 1.7 billion observations per day in 30 minutes with 92% accuracy.

Data & Analytics

Petabyte scale data science
using Spark & R
Sridhar Alla, Kiran Muglurmath
Comcast

Who we are
• Sridhar Alla
Director, Solution Architecture, Comcast
focuses on architecting and building solutions to meet the needs of the Enterprise Business
Intelligence initiatives.
• Kiran Muglurmath
Executive Director, Data Science, Comcast
focuses on architecting and building solutions to meet the needs of the Enterprise Business
Intelligence initiatives.

Top Initiatives
• Customer Churn Prediction
• Clickthru Analytics
• Personalization
• Customer Journey
• Modeling

• Enables using R packages to process data
• Can run Machine Learning and Statistical Analysis
algorithms
SparkR

Spark MLlib
• Implements various Machine Learning Algorithms
• Classification, Regression, Collaborative Filtering,
Clustering, Decomposition
• Works with Streaming, Spark SQL, GraphX or with
SparkR.

Hidden Markov Model (HMM)
• Supporting points go here.

Dataset Preparation: Training Data
• Supporting points go here.

Dataset Preparation: Raw Data
• Supporting points go here.

Baum – Welch algorithm for state detection
1. Given the download/upload levels (observations) for a given time
interval, the model detects the hidden streaming state for that interval.
2. Given a set of observations (i = 1 .. n), ith hidden variable is independent
of (i – 1)th hidden variable. For a discrete random variable Xt with N
possible values, assume at P(Xt|X{t-1}) is independent of time t
1. From observations, calculate transition probabilities for N possible
states. Then recursively compute maximum likelihoods for all
observations, backwards and forwards to identify most probable state
for each observation.

Sample Code (R):
• library('RHmm')
• indata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = """, dec = ".")
• testdata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = """, dec = ".")
• downloads <- c(as.numeric(indata$V4))
• downloadModel <- HMMFit(downloads, nStates=3)
• testdownloads <- c(as.numeric(testdata$V4))
• tVitPath <- viterbi(downloadModel, testdownloads)
• #Forward-backward procedure, compute probabilities
• tfb <- forwardBackward(downloadModel, testdownloads)
• # Plot implied states
• layout(1:3)
• plot(testdownloads[1:100],ylab="Down Bandwidth",type="l", main="Download bytes")
• plot(tVitPath$states[1:100],ylab="Download States",type="l", main="Download States")

Output for a test dataset
• Supporting points go here.

Parallelizing in Hadoop
Steps:
• Create sample datasetto build model.This can be a small sample (~2000 – 5000 rows),or a size sufficientto build
generalized model.
• Scriptmodel as an R file, exceptthatit should use streamed inputinstead ofreading from CSV files.Separate map.R and
reduce.R can be created ifa reduction stage is requiredto create unified outputdatasets.
• Test that code works from command line with structure below,where dataset.csv is the inputdatasetwith structure as shown
before
cat dataset.csv | map.R | reduce.R > output.csv
• Ensure thatHive tables are in delimited textformat.Deploy and run model using Hadoop streamingwith sample command
line below
hadoop jar /usr/hdp/2.2.6.4-1/hadoop-mapreduce/hadoop-streaming.jar
-D mapred.min.split.size=268435456
-D mapreduce.task.timeout=300000000
-D mapreduce.map.memory.mb=3584
-D mapreduce.reduce.memory.mb=8092
-input /user/hive/warehouse/ebidatascience.db/ipdr/local_day_id=$NEXT_DATE
-output /user/hive/warehouse/ebidatascience.db/ipdr_flagged/
-file ./map.R
-file <sample dataset to build model.csv>
-mapper ./map.R

Flagged output
• Supporting points go here.

Performance
• 1.7B observations/day
• About 30 minutes processing time/day
• 380 shared nodes
• 92% accuracy in detecting streaming events

Add Pages as Necessary
• Supporting points go here.

We are hiring!
• Big Data Engineers (Hadoop, Spark,
Kafka…)
• Data Analysts (R, SAS…..)
• Big Data Analysts (Hive, Pig ….)
sridhar_alla@cable.comcast.com

The document discusses mentorship modeling based on authorship graphs from Scopus data. It describes building features from co-authorship and correspondence graphs using Spark, validating predictions via crowdsourcing, and visualizing mentorship subgraphs. Key points include normalizing authorship data, aggregating node and edge features, applying pairwise mentorship models, obtaining training data via email campaigns, and using D3.js to interactively display mentorship subgraphs in Spark applications.

Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...

Databricks

In real-life applications, we often deal with situations where analysis needs to be conducted on graphs where the nodes and edges are associated with multiple labels. For example, in a graph that represents user activities in social networks, the labels associated with nodes may indicate their membership in communities (e.g. group, school, company, etc.), and the labels associated with edges may denote types of activities (e.g. comment, like, share, etc.). The current GraphX library in Spark does not directly support efficient calculation on the label-defined subgraph analysis and computations. In this session, the speakers will propose a general API library that is able to support analysis on multi-label graphs, and can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API. See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions. can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API. See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions.

Inside Apache SystemML by Frederick Reiss

Spark Summit

1. The document describes the origins and goals of the SystemML project for scalable machine learning. 2. SystemML was created to allow data scientists to write machine learning algorithms in R and automatically compile and optimize them to run efficiently on large datasets in parallel. 3. An example alternating least squares algorithm is shown written concisely in R, while traditional approaches required translating algorithms to other languages like Scala which was error-prone and slowed iteration. SystemML aims to allow the same algorithm to run fast at large scale with the same answer.

Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...

Databricks

From training billions of ad impressions to scaling gradient boosted trees with more than three million nodes, Ad Targeting at Yelp uses Apache Spark in many stages of its large-scale machine learning pipeline. This session will explore examples of how Yelp employed and tweaked Spark to support big data feature engineering, visualizations and machine learning model training, evaluation and diagnostics. You’ll also hear about the challenges in building and deploying such a large-scale intelligent system in a production environment.

Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins

Spark Summit

MLeap and Combust.ML allow machine learning pipelines developed in Spark to be deployed directly to production by serializing them to a common format and executing them outside of Spark. This addresses the common problem of data scientists developing models in Spark that then need to be rewritten by engineers for production. It also allows pipelines to be deployed via REST APIs with low latency. Benchmark tests showed average response times of 14ms for a linear regression pipeline and 24ms for a random forest pipeline on a MacBook Pro. Future work includes supporting more Spark and scikit-learn transformers and unifying model libraries with Spark.

Spark Summit EU talk by Elena Lazovik

Spark Summit

Elena Lazovik presents research on enabling dynamic on-the-fly modifications of Spark applications without stopping the application. This is achieved through a library called dynamic-spark that allows parameters and functions to be updated remotely at runtime. Experiments show this approach can be used to dynamically change calculation parameters, switch between data sources, and perform Monte Carlo simulations with updated functions. Future work includes open-sourcing dynamic-spark and expanding its capabilities to support streaming data.

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

Spark Summit

This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics. The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application. Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.

DASK and Apache Spark

Databricks

Gurpreet Singh from Microsoft gave a talk on scaling Python for data analysis and machine learning using DASK and Apache Spark. He discussed the challenges of scaling the Python data stack and compared options like DASK, Spark, and Spark MLlib. He provided examples of using DASK and PySpark DataFrames for parallel processing and showed how DASK-ML can be used to parallelize Scikit-Learn models. Distributed deep learning with tools like Project Hydrogen was also covered.

This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

Spark Summit

This document discusses building a graph of U.S. businesses using Spark technologies. It describes how Radius Intelligence builds a comprehensive business graph from multiple data sources by acquiring and preparing raw data, clustering records, and constructing the graph by linking business and location vertices and attributes through techniques like connected components analysis. Key lessons learned include that GraphX scales well, graph construction and updates are easy using RDD operations, and connected components analysis is an expensive graph operation.

Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Databricks

Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark. The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage. See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.

Spark DataFrames and ML Pipelines

Databricks

Data science on big data. Pragmatic approach

Pavel Mezentsev

This document discusses using Apache Spark for large scale machine learning problems with big data. It describes how Spark can be used to parallelize training and prediction tasks across large datasets that do not fit in memory. Spark allows using scikit-learn algorithms for machine learning tasks on big data by running the algorithms in a distributed manner across a Spark cluster. It also discusses alternative approaches to large scale machine learning, such as online/partial learning with stochastic gradient descent.

Download It

butest

The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.

Enhancements on Spark SQL optimizer by Min Qiu

Spark Summit

This document summarizes enhancements made to Spark SQL's optimizer including rule-based optimizations and cost-based optimizations. Rule-based optimizations included join condition push down through predicate rewrite and join order adjustment, as well as data volume reduction through column pruning enhancements. Cost-based optimizations leveraged statistics and histograms to select optimal join types, partitions, and join orders. Future work focuses on enumerating the full space of query plans and improving estimation accuracy.

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...

Spark Summit

This document discusses using Spark Streaming and GraphX to perform near-realtime analytics on large distributed systems. The authors present a model-driven approach to implement Pregel-style graph processing to handle heterogeneous graphs. They were able to achieve over 100,000 messages per second on a 4 node cluster by using sufficient batch sizes. Implementation challenges included scaling graph processing across nodes, dealing with graph heterogeneity, and hidden memory costs from intermediate RDDs. Lessons learned include the importance of partitioning, testing high availability, and addressing memory sinks.

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Databricks

This document discusses platforms for democratizing data science and enabling enterprise grade machine learning applications. It introduces Flock, a platform that aims to automate the machine learning lifecycle including tracking experiments, managing models, and deploying models for production. It demonstrates Flock by instrumenting Python code for a light gradient boosted machine model to track parameters, log models to MLFlow, convert the model to ONNX, optimize it, and deploy it as a REST API. Future work discussed includes improving Flock's data governance, generalizing auto-tracking capabilities, and integrating with other systems like SQL and Spark for end-to-end pipeline provenance.

Huawei Advanced Data Science With Spark Streaming

Jen Aman

This document discusses streamDM, an open source machine learning library for stream mining in Spark Streaming. It summarizes streamDM's capabilities for incremental learning on data streams using algorithms like SGD, Naive Bayes, clustering and decision trees. Examples of using streamDM in Huawei's network alarm analysis and fault localization systems are provided, demonstrating improvements in efficiency, accuracy and ability to handle large volumes of streaming data. The document encourages researchers to apply for Huawei's Innovation Research Program grants to further collaborative work on stream mining algorithms and applications.

MLeap: Release Spark ML Pipelines

DataWorks Summit/Hadoop Summit

MLeap is a tool that allows machine learning models trained using Spark ML to be deployed to production environments without Spark. It addresses common issues like data scientists and engineers having to re-write data pipelines and model code for production. MLeap uses Spark for training but removes the Spark dependency for deployment. It provides core machine learning components, a runtime for transformations, and serialization to bundle models. This allows models to be deployed to APIs and services more quickly than traditional Spark-based approaches. Benchmarks show MLeap models can transform data over 20x faster than equivalent Spark models.

Spark Summit EU talk by Nick Pentreath

Spark Summit

This document summarizes a presentation about scaling factorization machines on Apache Spark using parameter servers. The presentation introduces factorization machines as a method for modeling feature interactions in large datasets. It then discusses how distributed factorization machine models can be implemented on Spark using a parameter server approach, as demonstrated by the GlintFM library. Performance results show that GlintFM achieves significant speedups over the existing spark-libFM implementation on a large Criteo dataset, scaling to billions of parameters. Some remaining challenges are also outlined, along with ideas for future work.

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...

Spark Summit

This document discusses using graph-based machine learning on browsing history data to discover customer purchase intent for advertisers. It presents challenges with existing solutions like SVD that identify general online buyers but not advertiser-specific patterns. The document proposes representing sites as a graph and using GraphX's Pregel API to propagate positive customer labels along site connections, assigning higher scores to similar sites. Evaluation shows this approach identifies advertiser-relevant sites while addressing issues like model sparsity and frequency. It also provides lessons learned on optimizing Spark jobs.

Large Scale Machine learning with Spark

Md. Mahedi Kaysar

 Spark is an open source cluster computing framework that allows processing of large datasets across clusters of computers using a simple programming model. It provides high-level APIs in Java, Scala, Python and R.  Typical machine learning workflows in Spark involve loading data, preprocessing, feature engineering, training models, evaluating performance, and tuning hyperparameters. Spark MLlib provides algorithms for common tasks like classification, regression, clustering and collaborative filtering.  The document provides an example of building a spam filtering application in Spark. It involves reading email data, extracting features using tokenization and hashing, training a logistic regression model, evaluating performance on test data, and tuning hyperparameters via cross validation.

Extending Machine Learning Algorithms with PySpark

Databricks

1. The document discusses using PySpark and Pandas UDFs to perform machine learning at scale for genomic data. It describes a genomics use case called GloWGR that uses this approach. 2. Three key problems are identified with existing tools: genomic data is growing too quickly; bioinformaticians are unfamiliar with Scala; and ML algorithms are difficult to write in Spark SQL. The solutions proposed are to use Spark, provide a Python client, and write algorithms in Python linked to Spark. 3. GloWGR is presented as a novel whole genome regression and association study algorithm built with PySpark. It uses Pandas UDFs to parallelize the REGENIE method and perform tasks like dimensionality

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...

Spark Summit

Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.

Spark Summit EU talk by Bas Geerdink

Spark Summit

1. The document discusses using Lambda architecture principles to perform fast data analytics on streaming and batch data from IoT sources using Spark Streaming and MLlib. 2. A proposed smart parking use case would recommend the best parking garage by combining streaming GPS data from cars with batch updates on garage capacity, scoring each garage using machine learning models. 3. The Lambda architecture is implemented using Kafka to ingest streaming GPS updates and batch capacity updates, Spark Streaming and Spark SQL to prepare, transform, and join the data, and MLlib to score and rank the garages in real-time.

26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

Spark Summit

This document summarizes an approach to build an app recommendation engine using collaborative filtering on data from 53,793 apps and 496 million users. It implements collaborative filtering using 100 lines of Spark code to generate low-dimensional user and app features and compute 26 trillion user-app ratings. Key aspects included using data frames, caching joined data, broadcasting app features for efficient BLAS-3 matrix multiplication, and evaluating recommendations.

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

Databricks

"In addition to the many data engineering initiatives at Starbucks, we are also working on many interesting data science initatives. The business scenarios involved in our deep learning initatives include (but are not limited to) planogram analysis (layout of our stores for efficient partner and customer flow) to predicting product pairings (e.g. purchase a caramel machiato and perhaps you would like caramel brownie) via the product components using graph convolutional networks. For this session, we will be focusing on how we can run distributed Keras (TensorFlow backend) training to perform image analytics. This will be combined with MLflow to showcase the data science lifecycle and how Databricks + MLflow simplifies it. "

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Spark Summit

This document summarizes Unity Technologies' journey migrating their data pipeline from a legacy Hive-based system to using Spark. Some key points: - They moved to Spark for its scaling, performance, and ability to handle both batch and streaming workloads from a single stack. - The new Spark-based pipeline uses Airflow for workflow management and saves processed data to Parquet files stored in S3 for backup. - Taking a test-driven development approach with unit and integration tests helped ensure a smooth migration. Staging the pipeline in an environment similar to production also helped address issues early. - The new Spark pipeline completed analysis stages up to 2x faster than the previous Hive-based system and

OSINT for Attack and Defense

Andrew McNicol

This document provides an overview of how open-source intelligence (OSINT) techniques can be used both offensively and defensively. It discusses tools like Shodan, Maltego, Google searches, and malware sandboxes that can be leveraged to gather technical information about targets, infrastructure, and indicators of compromise. The document also emphasizes the importance of automation and privacy when conducting OSINT research to enhance attacks or strengthen defenses.

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Spark Summit

This document summarizes key aspects of running Spark Streaming applications in production, including fault tolerance, performance, and monitoring. It discusses how Spark Streaming receives data streams in batches and processes them across executors. It describes how driver and executor failures can be handled through checkpointing saved DAG information and write ahead logs that replicate received data blocks. Restarting the driver from checkpoints allows recovering the application state.

What's hot

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...

Spark Summit

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

Spark Summit

Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Databricks

Spark DataFrames and ML Pipelines

Databricks

Data science on big data. Pragmatic approach

Pavel Mezentsev

Download It

butest

Enhancements on Spark SQL optimizer by Min Qiu

Spark Summit

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...

Spark Summit

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Databricks

Huawei Advanced Data Science With Spark Streaming

Jen Aman

MLeap: Release Spark ML Pipelines

DataWorks Summit/Hadoop Summit

Spark Summit EU talk by Nick Pentreath

Spark Summit

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...

Spark Summit

Large Scale Machine learning with Spark

Md. Mahedi Kaysar

Extending Machine Learning Algorithms with PySpark

Databricks

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...

Spark Summit

Spark Summit EU talk by Bas Geerdink

Spark Summit

26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

Spark Summit

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

Databricks

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Spark Summit

What's hot (20)

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

Random Walks on Large Scale Graphs with Apache Spark with Min Shen

Spark DataFrames and ML Pipelines

Data science on big data. Pragmatic approach

Download It

Enhancements on Spark SQL optimizer by Min Qiu

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Huawei Advanced Data Science With Spark Streaming

MLeap: Release Spark ML Pipelines

Spark Summit EU talk by Nick Pentreath

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...

Large Scale Machine learning with Spark

Extending Machine Learning Algorithms with PySpark

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...

Spark Summit EU talk by Bas Geerdink

26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Viewers also liked

OSINT for Attack and Defense

Andrew McNicol

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Spark Summit

OSINT 2.0 - Past, present and future

Christian Martorella

How OSINT will play an important role in the future, helping to predict, prevent and react against incidents that threaten the Global security. The presentation will delve into the tools and techniques that enable OSINT practitioners to measure the Global security signals conveyed by the Internet. Multiple facets of information dissemination, collection, analysis and interpretation will be examined, with a focus on the security dimension of the information.

OSINT - Open Source Intelligence

c0c0n - International Cyber Security and Policing Conference

Open source intelligence, or OSINT, involves finding and analyzing publicly available information to produce actionable intelligence. Some common OSINT tools include Maltego for mapping relationships, AnonPaste Monitor to track leaked data, and social media monitoring on platforms like Twitter and Facebook. A case study example discusses using OSINT to analyze the "Lords of Dharmaraja" criminal network through tools like Nostradamus, which integrates diverse data sources and enables relationship analysis and pattern detection.

24 June 2015: Working with CDE

Defence and Security Accelerator

The Centre for Defence Enterprise (CDE) aims to fund innovative, high-risk research projects to develop cost-effective military capabilities. Over seven years, CDE received over 5,600 proposals, funded 931 of them totaling £57 million in investment. CDE seeks proposals from small- and medium-sized enterprises, academia, and wider industry through two routes: enduring competitions with £3 million annual funding and themed competitions focused on specific requirements. CDE operates under principles of applying innovation to develop future military capabilities through partnerships between government, industry, and investors.

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...

Spark Summit

This document discusses using Stanford NLP and Spark to extract relationships from unstructured text. It presents a pipeline for annotating entities in oil and gas supply chain text using NER, extracting relationships using pattern matching, and simplifying sentences. The pipeline is implemented using Spark for scalability and fault tolerance. Benefits of the approach include code reuse between batch and streaming layers and easy distribution of NLP processing.

Open Source Intelligence (OSINT)

festival ICT 2016

Durante l’intervento verranno presentati i cardini del processo di ricerca delle informazioni mediante la consultazione di fonti di pubblico accesso. Sarà illustrata la teoria alla base di questo processo che prevede l’identificazione delle fonti, la selezione e la valutazione del loro contenuto informativo per arrivare infine all’utilizzo stesso dell’informazione estratta. Nella seconda fase della presentazione verranno mostrati i tool e le metodologie per l’estrazione di informazioni mediante l’analisi di documenti, foto, social network e altre fonti spesso trascurate. In ultimo saranno mostrati sistemi in grado di correlare diverse informazioni provenienti dalle fonti aperte e verranno discussi i relativi scenari di utilizzo nonché le possibili contromisure.

2017 Digital Yearbook

We Are Social Singapore

Digital in 2017 Global Overview

We Are Social Singapore

- More than half of the world's population now uses the internet, with global internet users growing 8% year-over-year. Mobile internet and social media usage are also growing significantly. - Social media users grew over 20% in the past year to over 2.5 billion active users monthly. Mobile social media use in particular saw 30% growth. - The report provides statistics on internet, social media, and mobile usage globally and by region, finding continued growth in connectivity and usage around the world.

Global Digital Statshot Q3 2017

We Are Social Singapore

Viewers also liked (10)

OSINT for Attack and Defense

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

OSINT 2.0 - Past, present and future

OSINT - Open Source Intelligence

24 June 2015: Working with CDE

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...

Open Source Intelligence (OSINT)

2017 Digital Yearbook

Digital in 2017 Global Overview

Global Digital Statshot Q3 2017

Similar to Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....

Databricks

Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.

Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...

Data Con LA

"R is the most popular language in the data-science community with 2+ million users and 6000+ R packages. R’s adoption evolved along with its easy-to-use statistical language, graphics, packages, tools and active community. In this session we will introduce Distributed R, a new open-source technology that solves the scalability and performance limitations of vanilla R. Since R is single-threaded and does not scale to accommodate large datasets, Distributed R addresses many of R’s limitations. Distributed R efficiently shares sparse structured data, leverages multi-cores, and dynamically partitions data to mitigate load imbalance. In this talk, we will show the promise of this approach by demonstrating how important machine learning and graph algorithms can be expressed in a single framework and are substantially faster under Distributed R. Additionally, we will show how Distributed R complements Vertica, a state-of-the-art columnar analytics database, to deliver a full-cycle, fully integrated, data “prep-analyze-deploy” solution."

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Spark Summit

In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc. Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language. We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.

DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...

DataStax

Leveraging your operational data for advanced and predictive analytics enables deeper insights and greater value for cloud applications. DSE Analytics is a complete platform for Operational Analytics, including data ingestion, stream processing, batch analysis, and machine learning. In this talk we will provide an overview of DSE Analytics as it applies to data science tools and techniques, and demonstrate these via real world use cases and examples. Brian Hess Rob Murphy Rocco Varela About the Speakers Brian Hess Senior Product Manager, Analytics, DataStax Brian has been in the analytics space for over 15 years ranging from government to data mining applied research to analytics in enterprise data warehousing and NoSQL engines, in roles ranging from Cryptologic Mathematician to Director of Advanced Analytics to Senior Product Manager. In all these roles he has pushed data analytics and processing to massive scales in order to solve problems that were previously unsolvable.

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Spark Summit

The document discusses Sparkle, a solution built by Comcast to address challenges in processing massive amounts of data and enabling data science workflows at scale. Sparkle is a centralized processing system with SQL and machine learning capabilities that is highly scalable and accessible via a REST API. It is used by Comcast to power various use cases including churn modeling, price elasticity analysis, and direct mail campaign optimization.

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Badri Narayan Bhaskar

This document summarizes scaling machine learning to billions of parameters using Spark and a parameter server architecture. It describes the requirements for supporting both batch and sequential optimization at web scale. It then outlines the Spark + Parameter server approach, leveraging Spark for distributed processing and the parameter server for synchronizing model updates. Examples of distributed L-BFGS and Word2Vec training are provided to illustrate batch and sequential optimization respectively using this architecture.

Scaling Machine Learning To Billions Of Parameters

Jen Aman

This document summarizes scaling machine learning to billions of parameters using Spark and a parameter server architecture. It describes the requirements for supporting both batch and sequential optimization at web scale. It then outlines the Spark + Parameter server approach, leveraging Spark for distributed processing and the parameter server for synchronizing model updates. Examples of distributed L-BFGS and Word2Vec training are presented to illustrate batch and sequential optimization respectively.

Applied Machine learning using H2O, python and R Workshop

Avkash Chauhan

Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup Basic knowledge of R/python and general ML concepts Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop Level: 200 Time: 2 Hours Agenda: - Introduction to ML, H2O and Sparkling Water - Refresher of data manipulation in R & Python - Supervised learning ---- Understanding liner regression model with an example ---- Understanding binomial classification with an example ---- Understanding multinomial classification with an example - Unsupervised learning ---- Understanding k-means clustering with an example - Using machine learning models in production - Sparkling Water Introduction & Demo

Making sense of the Graph Revolution

InfiniteGraph

In 2013: - 1.4 Trillion digital interactions happen per month. - 2.9 million emails are sent every second. - 72.9 products are ordered on Amazon per second. That is a lot of connected data, graphs are truly everywhere. Companies are finding that graph database technology is helping them make sense of their big data. Objectivity’s Nick Quinn, Chief Architect of InfiniteGraph, shows us just how popular graph databases have become and where they are being used, as well as showing us the ins and outs. Do you want to build technology that does great things with big data? You might want to find out what your colleagues are Tweeting about, make recommendations for apps, music or other retail that result in higher purchase rates, discover hidden connections between new and recorded medical research data, or maybe even leverage intel across government agencies to catch the bad guys. All this is possible with a graph database.

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Databricks

Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Databricks

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations. We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

Apache Eagle - Monitor Hadoop in Real Time

DataWorks Summit/Hadoop Summit

Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop created by eBay to address limitations of existing tools in handling large volumes of metrics and logs from Hadoop clusters. It provides data activity monitoring, job performance monitoring, and unified monitoring. Eagle detects anomalies using machine learning algorithms and notifies users through alerts. It has been deployed across multiple eBay clusters with over 10,000 nodes and processes hundreds of thousands of events per day.

Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...

PAPIs.io

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

Chetan Khatri

TransmogrifAI is an open source library for automating machine learning workflows built on Scala and Spark. It helps automate tasks like feature engineering, selection, model selection, and hyperparameter tuning. This reduces machine learning development time from months to hours. TransmogrifAI enforces type safety and modularity to build reusable, production-ready models. It was created by Salesforce to make machine learning more accessible to developers without a PhD in machine learning.

Real time streaming analytics

Anirudh

AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)

Amazon Web Services Korea

This document discusses how Amazon SageMaker can be used to train machine learning models on large datasets using hosted Jupyter notebooks. It notes that DigitalGlobe plans to use SageMaker to train models on petabytes of Earth observation imagery so that users can create and deploy models within one scalable environment. The document also quotes the CTO of Maxar Technologies saying they will use SageMaker to build and deploy novel AI algorithms at scale to solve complex problems.

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Miklos Christine

Powering a Graph Data System with Scylla + JanusGraph

ScyllaDB

Recent Developments In SparkR For Advanced Analytics

Databricks

Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.

Alex mang patterns for scalability in microsoft azure application

Codecamp Romania

The document discusses patterns for scalability in Microsoft Azure applications. It covers queue-based load leveling, competing consumers, and priority queue patterns for handling application load and message processing. It also discusses materialized view and sharding patterns for scaling databases, where materialized views optimize queries and sharding partitions data horizontally across multiple servers. The talk includes demos of priority queue and sharding patterns to illustrate their implementations.

Similar to Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath (20)

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....

Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Scaling Machine Learning To Billions Of Parameters

Applied Machine learning using H2O, python and R Workshop

Making sense of the Graph Revolution

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Apache Eagle - Monitor Hadoop in Real Time

Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

Real time streaming analytics

AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Powering a Graph Data System with Scylla + JanusGraph

Recent Developments In SparkR For Advanced Analytics

Alex mang patterns for scalability in microsoft azure application

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

Spark Summit

In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Spark Summit

In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Spark Summit

This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

Spark Summit

As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

Spark Summit

Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Spark Summit

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Spark Summit

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.

Next CERN Accelerator Logging Service with Jakub Wozniak

Spark Summit

The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service. The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex. The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments. During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.

Powering a Startup with Apache Spark with Kevin Kim

Spark Summit

In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Spark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

Spark Summit

In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects. In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API. For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spark Summit

Nielsen used Databricks to test new digital advertising rating methodologies on a large scale. Databricks allowed Nielsen to run analyses on thousands of advertising campaigns using both small panel data and large production data. This identified edge cases and performance gains faster than traditional methods. Using Databricks reduced the time required to test and deploy improved rating methodologies to benefit Nielsen's clients.

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Spark Summit

Goal Based Data Production with Sim Simeonov

Spark Summit

Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Spark Summit

Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Spark Summit

Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Spark Summit

Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Spark Summit

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理

xclpvhuk

原版制作【微信:41543339】【(Unimelb毕业证书)墨尔本大学毕业证】【微信:41543339】《成绩单、外壳、雅思、offer、留信学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同进口机器一比一制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf

Fernanda Palhano

Global Situational Awareness of A.I. and where its headed

vikram sood

You can see the future first in San Francisco. Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum. The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war. Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change. Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride. Let me tell you what we see.

Everything you wanted to know about LIHTC

Roger Valdez

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理

nuttdpt

毕业原版【微信:176555708】【(UCSF毕业证书)旧金山分校毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx

SaffaIbrahim1

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理

nuttdpt

毕业原版【微信:176555708】【(UCSB毕业证书)圣芭芭拉分校毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

bopyb

毕业原版【微信:176555708】【(GWU,GW毕业证书)乔治·华盛顿大学毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理

wyddcwye1

原版制作【微信:41543339】【利兹贝克特大学毕业证(LeedsBeckett毕业证书)】【微信:41543339】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路）我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Experts live - Improving user adoption with AI

jitskeb

End-to-end pipeline agility - Berlin Buzzwords 2024

Lars Albertsson

We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines. A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more. A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream. Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.

Palo Alto Cortex XDR presentation .......

Sachin Paul

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样

apvysm8

原版一模一样【微信：741003700 】【(uts毕业证书)悉尼科技大学毕业证学历证书】【微信：741003700 】学位证，留信认证（真实可查，永久存档）offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原海外各大学 Bachelor Diploma degree, Master Degree Diploma 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

DSSML24_tspann_CodelessGenerativeAIPipelines

Timothy Spann

Codeless Generative AI Pipelines (GenAI with Milvus) https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience. Timothy Spann https://www.youtube.com/@FLaNK-Stack https://medium.com/@tspann https://www.datainmotion.dev/ milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge

A presentation that explain the Power BI Licensing

AlessioFois2

The Building Blocks of QuestDB, a Time Series Database

javier ramirez

Talk Delivered at Valencia Codes Meetup 2024-06. Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds. It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

Walaa Eldin Moustafa

Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines. #SQL #Views #Privacy #Compliance #DataLake

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM

Timothy Spann

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM by Timothy Spann Principal Developer Advocate https://budapestdata.hu/2024/en/ https://budapestml.hu/2024/en/ tim.spann@zilliz.com https://www.linkedin.com/in/timothyspann/ https://x.com/paasdev https://github.com/tspannhw https://www.youtube.com/@flank-stack milvus vector database gen ai generative ai deep learning machine learning apache nifi apache pulsar apache kafka apache flink

Analysis insight about a Flyball dog competition team's performance

roli9797

Learn SQL from basic queries to Advance queries

manishkhaire30

Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively. Key Highlights: Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation. Advanced Queries: Learn to craft complex queries to uncover deep insights from your data. Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets. Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios. Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making. Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data! #DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics

Recently uploaded (20)

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf

Global Situational Awareness of A.I. and where its headed

Everything you wanted to know about LIHTC

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理

Experts live - Improving user adoption with AI

End-to-end pipeline agility - Berlin Buzzwords 2024

Palo Alto Cortex XDR presentation .......

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样

DSSML24_tspann_CodelessGenerativeAIPipelines

A presentation that explain the Power BI Licensing

The Building Blocks of QuestDB, a Time Series Database

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM

Analysis insight about a Flyball dog competition team's performance

Learn SQL from basic queries to Advance queries

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

1. Petabyte scale data science using Spark & R Sridhar Alla, Kiran Muglurmath Comcast

2. Who we are • Sridhar Alla Director, Solution Architecture, Comcast focuses on architecting and building solutions to meet the needs of the Enterprise Business Intelligence initiatives. • Kiran Muglurmath Executive Director, Data Science, Comcast focuses on architecting and building solutions to meet the needs of the Enterprise Business Intelligence initiatives.

3. Top Initiatives • Customer Churn Prediction • Clickthru Analytics • Personalization • Customer Journey • Modeling

4. Spark Stack

5. • Enables using R packages to process data • Can run Machine Learning and Statistical Analysis algorithms SparkR

6. Spark MLlib • Implements various Machine Learning Algorithms • Classification, Regression, Collaborative Filtering, Clustering, Decomposition • Works with Streaming, Spark SQL, GraphX or with SparkR.

7. Using PySpark & SparkR

8. Hidden Markov Model (HMM) • Supporting points go here.

9. Dataset Preparation: Training Data • Supporting points go here.

10. Dataset Preparation: Raw Data • Supporting points go here.

11. Baum – Welch algorithm for state detection 1. Given the download/upload levels (observations) for a given time interval, the model detects the hidden streaming state for that interval. 2. Given a set of observations (i = 1 .. n), ith hidden variable is independent of (i – 1)th hidden variable. For a discrete random variable Xt with N possible values, assume at P(Xt|X{t-1}) is independent of time t 1. From observations, calculate transition probabilities for N possible states. Then recursively compute maximum likelihoods for all observations, backwards and forwards to identify most probable state for each observation.

12. Sample Code (R): • library('RHmm') • indata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = """, dec = ".") • testdata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = """, dec = ".") • downloads <- c(as.numeric(indata$V4)) • downloadModel <- HMMFit(downloads, nStates=3) • testdownloads <- c(as.numeric(testdata$V4)) • tVitPath <- viterbi(downloadModel, testdownloads) • #Forward-backward procedure, compute probabilities • tfb <- forwardBackward(downloadModel, testdownloads) • # Plot implied states • layout(1:3) • plot(testdownloads[1:100],ylab="Down Bandwidth",type="l", main="Download bytes") • plot(tVitPath$states[1:100],ylab="Download States",type="l", main="Download States")

13. Output for a test dataset • Supporting points go here.

14. Parallelizing in Hadoop Steps: • Create sample datasetto build model.This can be a small sample (~2000 – 5000 rows),or a size sufficientto build generalized model. • Scriptmodel as an R file, exceptthatit should use streamed inputinstead ofreading from CSV files.Separate map.R and reduce.R can be created ifa reduction stage is requiredto create unified outputdatasets. • Test that code works from command line with structure below,where dataset.csv is the inputdatasetwith structure as shown before cat dataset.csv | map.R | reduce.R > output.csv • Ensure thatHive tables are in delimited textformat.Deploy and run model using Hadoop streamingwith sample command line below hadoop jar /usr/hdp/2.2.6.4-1/hadoop-mapreduce/hadoop-streaming.jar -D mapred.min.split.size=268435456 -D mapreduce.task.timeout=300000000 -D mapreduce.map.memory.mb=3584 -D mapreduce.reduce.memory.mb=8092 -input /user/hive/warehouse/ebidatascience.db/ipdr/local_day_id=$NEXT_DATE -output /user/hive/warehouse/ebidatascience.db/ipdr_flagged/ -file ./map.R -file <sample dataset to build model.csv> -mapper ./map.R

15. Flagged output • Supporting points go here.

16. Performance • 1.7B observations/day • About 30 minutes processing time/day • 380 shared nodes • 92% accuracy in detecting streaming events

17. Output for a test dataset • Supporting points go here.

18. Add Pages as Necessary • Supporting points go here.

19. We are hiring! • Big Data Engineers (Hadoop, Spark, Kafka…) • Data Analysts (R, SAS…..) • Big Data Analysts (Hive, Pig ….) sridhar_alla@cable.comcast.com

20. THANK YOU.

21. Output for a test dataset • Supporting points go here.

22. Add Pages as Necessary • Supporting points go here.

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Similar to Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath