This talk presents you how three scala libraries - Smile, Saddle and Spark ML - satisfy requirements of new Big Data Science projects. Let's see it on example of click-through rate prediction.
1. The document discusses big data and data science libraries in Scala for tasks like preprocessing, machine learning, and evaluation.
2. It demonstrates using Spark and Smile libraries on a real dataset to optimize click-through rates by analyzing features like OS, categories, and time.
3. The document compares the performance of Spark and Smile for random forest classification and regression on a 13GB dataset.
Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.
This was presented by the Yongsheng Wu, head of big data and ML platform at Pinterest, at the Alluxio bay area meetup.
Yongsheng shares Pinterest's journey to build a fast and scalable big data and ML platform in AWS for Pinterest to handle the requests and complexity in data at scale. In this talk, he will cover different aspects from the requirements of the platform, the challenges encountered, the technologies chosen, and the tradeoffs that were made.
GraphLab Conference 2014 Keynote - Carlos GuestrinTuri, Inc.
This document introduces GraphLab Create, a machine learning toolkit that aims to help data scientists unleash the power of data science from inspiration to production. It highlights key features of GraphLab Create including scalable data structures that allow analyzing big data on a single machine without running out of memory, robust machine learning algorithms, and tools for deploying predictive applications and services to production environments from the same code used for prototyping. The document provides examples of using GraphLab Create for tasks like recommender systems, fraud detection, and deep learning. It emphasizes that GraphLab Create allows users to be productive on a single machine, at scale, and in production.
Machine Learning at Scale with MLflow and Apache SparkDatabricks
This document summarizes the challenges faced by SocGen, a large French bank, in implementing machine learning at scale using Spark and MLflow. Some key challenges included: 1) Keeping data and models local for regulatory reasons while performing training and prediction, 2) Ensuring reliability when moving models between prototyping and production phases, 3) Managing different Python package dependencies, 4) Tracking and managing many models, and 5) Ensuring high availability of the tracking server. The presentation provided a concrete example of using Spark, MLflow, and Kafka to periodically retrain a model for scoring news articles and handling user feedback in a scalable and reliable way.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
We are at the dawn of digital businesses, that are reimagined to make the best use of digital technologies such as automation, analytics, cloud, and integration. These businesses are efficient, continuously optimizing, proactive, flexible and able to understand customers in detail. A key part of a digital business is analytics: the eyes and ears of the system that tracks and provides a detailed view on what was and what is and lets decision makers predict what will be.
This session will explore how the WSO2 analytics platform
Plays a role in your digital transformation journey
Collects and analyzes data through batch, real-time, interactive and predictive processing technologies
Lets you communicate the results through dashboards
Brings together all analytics technologies into a single platform and user experience
Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.
This document discusses how GraphLab Create can be used to build reusable data pipelines for predictive analytics. It provides examples of how tasks like model training, recommendation generation, and result persistence can be modularized and executed together as workflows. Key benefits highlighted include portability of code across environments like Hadoop and EC2, ability to incrementally develop and monitor pipelines, and managing dependencies and configurations automatically.
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
This document discusses how Caserta Concepts used Apache Spark to help a customer master their customer data by cleaning, standardizing, matching, and linking over 6 million customer records and hundreds of millions of data points. Traditional customer data integration approaches were prohibitively expensive and slow for this volume of data. Spark enabled the data to be processed 10x faster by parallelizing data cleansing and transformation. GraphX was also used to model the data as a graph and identify linked customer records, reducing survivorship processing from 2 hours to under 5 minutes.
1. The document discusses big data and data science libraries in Scala for tasks like preprocessing, machine learning, and evaluation.
2. It demonstrates using Spark and Smile libraries on a real dataset to optimize click-through rates by analyzing features like OS, categories, and time.
3. The document compares the performance of Spark and Smile for random forest classification and regression on a 13GB dataset.
Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.
This was presented by the Yongsheng Wu, head of big data and ML platform at Pinterest, at the Alluxio bay area meetup.
Yongsheng shares Pinterest's journey to build a fast and scalable big data and ML platform in AWS for Pinterest to handle the requests and complexity in data at scale. In this talk, he will cover different aspects from the requirements of the platform, the challenges encountered, the technologies chosen, and the tradeoffs that were made.
GraphLab Conference 2014 Keynote - Carlos GuestrinTuri, Inc.
This document introduces GraphLab Create, a machine learning toolkit that aims to help data scientists unleash the power of data science from inspiration to production. It highlights key features of GraphLab Create including scalable data structures that allow analyzing big data on a single machine without running out of memory, robust machine learning algorithms, and tools for deploying predictive applications and services to production environments from the same code used for prototyping. The document provides examples of using GraphLab Create for tasks like recommender systems, fraud detection, and deep learning. It emphasizes that GraphLab Create allows users to be productive on a single machine, at scale, and in production.
Machine Learning at Scale with MLflow and Apache SparkDatabricks
This document summarizes the challenges faced by SocGen, a large French bank, in implementing machine learning at scale using Spark and MLflow. Some key challenges included: 1) Keeping data and models local for regulatory reasons while performing training and prediction, 2) Ensuring reliability when moving models between prototyping and production phases, 3) Managing different Python package dependencies, 4) Tracking and managing many models, and 5) Ensuring high availability of the tracking server. The presentation provided a concrete example of using Spark, MLflow, and Kafka to periodically retrain a model for scoring news articles and handling user feedback in a scalable and reliable way.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
We are at the dawn of digital businesses, that are reimagined to make the best use of digital technologies such as automation, analytics, cloud, and integration. These businesses are efficient, continuously optimizing, proactive, flexible and able to understand customers in detail. A key part of a digital business is analytics: the eyes and ears of the system that tracks and provides a detailed view on what was and what is and lets decision makers predict what will be.
This session will explore how the WSO2 analytics platform
Plays a role in your digital transformation journey
Collects and analyzes data through batch, real-time, interactive and predictive processing technologies
Lets you communicate the results through dashboards
Brings together all analytics technologies into a single platform and user experience
Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.
This document discusses how GraphLab Create can be used to build reusable data pipelines for predictive analytics. It provides examples of how tasks like model training, recommendation generation, and result persistence can be modularized and executed together as workflows. Key benefits highlighted include portability of code across environments like Hadoop and EC2, ability to incrementally develop and monitor pipelines, and managing dependencies and configurations automatically.
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
This document discusses how Caserta Concepts used Apache Spark to help a customer master their customer data by cleaning, standardizing, matching, and linking over 6 million customer records and hundreds of millions of data points. Traditional customer data integration approaches were prohibitively expensive and slow for this volume of data. Spark enabled the data to be processed 10x faster by parallelizing data cleansing and transformation. GraphX was also used to model the data as a graph and identify linked customer records, reducing survivorship processing from 2 hours to under 5 minutes.
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...Databricks
As an e-commerce company leading in fashion and lifestyle in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for customers. Using Spark, the data science team is able to develop various machine-learning projects that improve the shopping experience.
One of the applications is to create a service for retrieving visually similar products, which can then be used to show substitutional products, to build visual recommenders and to improve the overall recommendation system. In this project, Spark is used throughout the entire pipeline: retrieving and processing the image data, training model distributedly with Tensorflow, extracting image features, and computing similarity. In this talk, we are going to demonstrate how Spark and the Databricks enable a small team to unify data and AI workflows, develop a pipeline for visual similarity and train dedicated neural network models.
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
This document provides an overview of production-ready machine learning workflows. It discusses challenges of big ML including skill gaps, dimensionality, and model complexity. The solution is presented as a workflow that includes preprocessing, naive implementation, monitoring with dashboards, optimization, A/B testing, and iteration. Key steps are to measure first before optimizing, start small and grow, test infrastructure, and establish a baseline before optimizing models. The document provides examples of applying these workflows at Waze for tasks like irregular traffic event detection, dangerous place identification, and speed limit inference.
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
Machine Learning is everywhere, but translating a data scientist’s model into an operational environment is challenging for many reasons. Models may need to be distributed to remote applications to generate predictions, or in the case of re-training, existing models may need to be updated or replaced. To monitor and diagnose such configurations requires tracking many variables (such as performance counters, models, ML algorithm specific statistics and more).
In this talk we will demonstrate how we have attacked this problem for a specific use case, edge based anomaly detection. We will show how Spark can be deployed in two types of environments (on edge nodes where the ML predictions can detect anomalies in real time, and on a cloud based cluster where new model coefficients can be computed on a larger collection of available data). To make this solution practically deployable, we have developed mechanisms to automatically update the edge prediction pipelines with new models, regularly retrain at the cloud instance, and gather metrics from all pipelines to monitor, diagnose and detect issues with the entire workflow. Using SparkML and Spark Accumulators, we have developed an ML pipeline framework capable of automating such deployments and a distributed application monitoring framework to aid in live monitoring.
The talk will describe the problems of operationalizing ML in an Edge context, our approaches to solving them and what we have learned, and include a live demo of our approach using anomaly detection ML algorithms in SparkML and others (clustering etc.) and live data feeds. All datasets and outputs will be made publicly available.
Autodeploy a complete end-to-end machine learning pipeline on Kubernetes using tools like Spark, TensorFlow, HDFS, etc. - it requires a running Kubernetes (K8s) cluster in the cloud or on-premise.
The document discusses the challenges data scientists face in operationalizing big data projects and making the results accessible for broader organizational use. It argues that within the next 18 months, big data will become integrated into standard reporting and analysis used by all employees, not just data scientists. However, current tools like Hadoop are too slow for interactive work. New technologies are needed that provide massively parallel processing and tightly integrate with Hadoop, but also allow for use of existing reporting tools. This will require analytical platforms with in-memory processing capabilities and low latency.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
This document provides an overview of the skills, tools, and techniques needed for big data science. It discusses infrastructure requirements like Hadoop and NoSQL, as well as necessary talent and analytic capabilities. A case study is presented using data from Stack Overflow to demonstrate the end-to-end process of exploring data, building features, creating structured and unstructured models, and ensembling models to solve a business problem. The document emphasizes that achieving early success in big data science requires a blend of analysis and scripting skills along with an understanding of relevant techniques, but large teams of PhDs or major investments are not necessarily needed.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
This document provides an overview of predictive modelling with Azure Machine Learning. It discusses trends in internet of things and big data that are driving growth in machine learning. It introduces machine learning concepts and how Azure ML can be used to build predictive models with strengths like a visual interface and support for collaborative work. The document outlines the Azure ML workflow from exploring data in the studio to operationalizing models with API services.
Machine Learning with Big Data using Apache SparkInSemble
"Machine Learning with Big Data
using Apache Spark" was presented to Lansing Big Data and Hadoop User Group by Muk Agaram and Amit Singh on 3/31/2015. It goes over the basics of machine learning and demos a use case of predicting recession using Apache Spark through Logistic Regression, SVM and Random Forest Algorithm
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Stefan Urbanek
This keynote looks at some very common forces and threats that are causing common suffering in a data warehouse. Shows examples why the concepts are still relevant despite having all high-end technology. Provides suggestions for starting with architecture and metadata.
What you need to know to start an AI company?Mo Patel
An overview of why AI and Deep Learning are hot now? Overview f Machine Intelligence startups. What are the key ingredients for AI startup? How can AI startups compete with big tech companies and areas to focus on for differentiation?
The More the Merrier: Scaling Model Building Infrastructure at ZendeskDatabricks
Significant amount of effort is required to transform a machine learning (ML) model into a useful machine learning product. The incorporation of ML into real world applications almost feels like "1% algorithm and 99% perspiration". I will share with you my team experience in building 3 ML products at Zendesk. I will also discuss some real-world problems and scaling complexities you may encounter when building these products at web scale. Close collaboration with different groups including product, engineering and data science is imperative to strike the balance between model performance, scalability and computational efficiency. The talk mainly focuses on scaling our model building infrastructure with an aim to build at least 50,000 models a day. This is achieved as part of our efforts to deliver a ML product called Content Cues. In a nutshell, Content Cues summarizes text from customers support tickets to form insightful topics. It combines multiple ML algorithms including deep learning, clustering and other natural language processing approaches. These ML algorithms are then run through tens of thousands of eligible Zendesk customer data every day. My talk will cover the following topics: How we implement a horizontally scalable model building and model serving pipeline by combining AWS EMR, AWS Batch and Kubernetes How we tune the model building pipeline to optimize cost and efficiency without compromising resiliency Challenges in model monitoring, model versioning evolution and capturing of user feedback
Speaker: Wai Chee Yau
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/xc3j20Om3UM
Description:
Data science is indeed one of the sexy jobs of the 21st century. But it is also a lot of hard work. And the hard work is seldom about the math or the algorithms. It is about building relevant machine learning products for the real world. We will go over some of the must-haves as you take your machine learning model out of the sandbox and make it work in the big, bad world outside.
Speaker's Bio:
Krish Swamy is an experienced professional with deep skills in applying analytics and BigData capabilities to challenging business problems and driving customer insights. Krish's analytic experience includes marketing and pricing, credit risk, digital analytics and most recently, big data analytics and data transformation. His key experiences lie in banking and financial services, the digital customer experience domain, with a background in management consulting. Other key skills include influencing organizational change towards a data and analytics driven culture, and building teams of analysts, statisticians and data scientists.
This document discusses recommendations and personalization at Rakuten. It notes that Rakuten has over 100 million users and handles over 40 million item views per day. Recommendation challenges include dealing with different languages, user behaviors, business areas, and aggregating data across services. Rakuten uses a member-based business model that connects its various services through a common Rakuten ID. The document outlines Rakuten's business-to-business-to-consumer model and how recommendations must handle many shops, item references, and a global catalog. It also provides an overview of Rakuten's recommendation system and some of the challenges in generating and ranking recommendation candidates.
1) Machine learning and predictive analytics can be used to analyze large datasets and build models to find useful insights, predict outcomes, and provide competitive advantages.
2) WSO2 Machine Learner is a product that allows users to upload data, train machine learning models using various algorithms, compare results, and iterate on models.
3) Example use cases demonstrated by WSO2 Machine Learner include predicting airport wait times, tracking people via Bluetooth, predicting the Super Bowl winner, detecting defective manufacturing equipment, and identifying promising customers.
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
Accelerating Production Machine Learning with MLflow with Matei ZahariaDatabricks
Successfully building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with 45 contributors and new features new multiple language APIs, integrations with popular ML libraries, and storage backends. I’ll go through some of the newly released features and explain how to get started with MLflow.
This document discusses auditing reactive applications to detect blocking API calls. It describes how blocking calls can negatively impact performance in reactive systems by consuming thread pools. Various techniques for detecting blocking calls are examined, including modifying the JDK, generating warnings during compilation, and instrumenting code at runtime using a JVM agent. Aspect programming is highlighted as a way to audit applications at load time by weaving in checks for over 500 blocking methods across many Java APIs. The reactive-audit tool is introduced as an open source project for helping developers test for blocking calls in frameworks like Play, Jetty, and Akka.
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...Databricks
As an e-commerce company leading in fashion and lifestyle in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for customers. Using Spark, the data science team is able to develop various machine-learning projects that improve the shopping experience.
One of the applications is to create a service for retrieving visually similar products, which can then be used to show substitutional products, to build visual recommenders and to improve the overall recommendation system. In this project, Spark is used throughout the entire pipeline: retrieving and processing the image data, training model distributedly with Tensorflow, extracting image features, and computing similarity. In this talk, we are going to demonstrate how Spark and the Databricks enable a small team to unify data and AI workflows, develop a pipeline for visual similarity and train dedicated neural network models.
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
This document provides an overview of production-ready machine learning workflows. It discusses challenges of big ML including skill gaps, dimensionality, and model complexity. The solution is presented as a workflow that includes preprocessing, naive implementation, monitoring with dashboards, optimization, A/B testing, and iteration. Key steps are to measure first before optimizing, start small and grow, test infrastructure, and establish a baseline before optimizing models. The document provides examples of applying these workflows at Waze for tasks like irregular traffic event detection, dangerous place identification, and speed limit inference.
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
Machine Learning is everywhere, but translating a data scientist’s model into an operational environment is challenging for many reasons. Models may need to be distributed to remote applications to generate predictions, or in the case of re-training, existing models may need to be updated or replaced. To monitor and diagnose such configurations requires tracking many variables (such as performance counters, models, ML algorithm specific statistics and more).
In this talk we will demonstrate how we have attacked this problem for a specific use case, edge based anomaly detection. We will show how Spark can be deployed in two types of environments (on edge nodes where the ML predictions can detect anomalies in real time, and on a cloud based cluster where new model coefficients can be computed on a larger collection of available data). To make this solution practically deployable, we have developed mechanisms to automatically update the edge prediction pipelines with new models, regularly retrain at the cloud instance, and gather metrics from all pipelines to monitor, diagnose and detect issues with the entire workflow. Using SparkML and Spark Accumulators, we have developed an ML pipeline framework capable of automating such deployments and a distributed application monitoring framework to aid in live monitoring.
The talk will describe the problems of operationalizing ML in an Edge context, our approaches to solving them and what we have learned, and include a live demo of our approach using anomaly detection ML algorithms in SparkML and others (clustering etc.) and live data feeds. All datasets and outputs will be made publicly available.
Autodeploy a complete end-to-end machine learning pipeline on Kubernetes using tools like Spark, TensorFlow, HDFS, etc. - it requires a running Kubernetes (K8s) cluster in the cloud or on-premise.
The document discusses the challenges data scientists face in operationalizing big data projects and making the results accessible for broader organizational use. It argues that within the next 18 months, big data will become integrated into standard reporting and analysis used by all employees, not just data scientists. However, current tools like Hadoop are too slow for interactive work. New technologies are needed that provide massively parallel processing and tightly integrate with Hadoop, but also allow for use of existing reporting tools. This will require analytical platforms with in-memory processing capabilities and low latency.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
This document provides an overview of the skills, tools, and techniques needed for big data science. It discusses infrastructure requirements like Hadoop and NoSQL, as well as necessary talent and analytic capabilities. A case study is presented using data from Stack Overflow to demonstrate the end-to-end process of exploring data, building features, creating structured and unstructured models, and ensembling models to solve a business problem. The document emphasizes that achieving early success in big data science requires a blend of analysis and scripting skills along with an understanding of relevant techniques, but large teams of PhDs or major investments are not necessarily needed.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
This document provides an overview of predictive modelling with Azure Machine Learning. It discusses trends in internet of things and big data that are driving growth in machine learning. It introduces machine learning concepts and how Azure ML can be used to build predictive models with strengths like a visual interface and support for collaborative work. The document outlines the Azure ML workflow from exploring data in the studio to operationalizing models with API services.
Machine Learning with Big Data using Apache SparkInSemble
"Machine Learning with Big Data
using Apache Spark" was presented to Lansing Big Data and Hadoop User Group by Muk Agaram and Amit Singh on 3/31/2015. It goes over the basics of machine learning and demos a use case of predicting recession using Apache Spark through Logistic Regression, SVM and Random Forest Algorithm
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Stefan Urbanek
This keynote looks at some very common forces and threats that are causing common suffering in a data warehouse. Shows examples why the concepts are still relevant despite having all high-end technology. Provides suggestions for starting with architecture and metadata.
What you need to know to start an AI company?Mo Patel
An overview of why AI and Deep Learning are hot now? Overview f Machine Intelligence startups. What are the key ingredients for AI startup? How can AI startups compete with big tech companies and areas to focus on for differentiation?
The More the Merrier: Scaling Model Building Infrastructure at ZendeskDatabricks
Significant amount of effort is required to transform a machine learning (ML) model into a useful machine learning product. The incorporation of ML into real world applications almost feels like "1% algorithm and 99% perspiration". I will share with you my team experience in building 3 ML products at Zendesk. I will also discuss some real-world problems and scaling complexities you may encounter when building these products at web scale. Close collaboration with different groups including product, engineering and data science is imperative to strike the balance between model performance, scalability and computational efficiency. The talk mainly focuses on scaling our model building infrastructure with an aim to build at least 50,000 models a day. This is achieved as part of our efforts to deliver a ML product called Content Cues. In a nutshell, Content Cues summarizes text from customers support tickets to form insightful topics. It combines multiple ML algorithms including deep learning, clustering and other natural language processing approaches. These ML algorithms are then run through tens of thousands of eligible Zendesk customer data every day. My talk will cover the following topics: How we implement a horizontally scalable model building and model serving pipeline by combining AWS EMR, AWS Batch and Kubernetes How we tune the model building pipeline to optimize cost and efficiency without compromising resiliency Challenges in model monitoring, model versioning evolution and capturing of user feedback
Speaker: Wai Chee Yau
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/xc3j20Om3UM
Description:
Data science is indeed one of the sexy jobs of the 21st century. But it is also a lot of hard work. And the hard work is seldom about the math or the algorithms. It is about building relevant machine learning products for the real world. We will go over some of the must-haves as you take your machine learning model out of the sandbox and make it work in the big, bad world outside.
Speaker's Bio:
Krish Swamy is an experienced professional with deep skills in applying analytics and BigData capabilities to challenging business problems and driving customer insights. Krish's analytic experience includes marketing and pricing, credit risk, digital analytics and most recently, big data analytics and data transformation. His key experiences lie in banking and financial services, the digital customer experience domain, with a background in management consulting. Other key skills include influencing organizational change towards a data and analytics driven culture, and building teams of analysts, statisticians and data scientists.
This document discusses recommendations and personalization at Rakuten. It notes that Rakuten has over 100 million users and handles over 40 million item views per day. Recommendation challenges include dealing with different languages, user behaviors, business areas, and aggregating data across services. Rakuten uses a member-based business model that connects its various services through a common Rakuten ID. The document outlines Rakuten's business-to-business-to-consumer model and how recommendations must handle many shops, item references, and a global catalog. It also provides an overview of Rakuten's recommendation system and some of the challenges in generating and ranking recommendation candidates.
1) Machine learning and predictive analytics can be used to analyze large datasets and build models to find useful insights, predict outcomes, and provide competitive advantages.
2) WSO2 Machine Learner is a product that allows users to upload data, train machine learning models using various algorithms, compare results, and iterate on models.
3) Example use cases demonstrated by WSO2 Machine Learner include predicting airport wait times, tracking people via Bluetooth, predicting the Super Bowl winner, detecting defective manufacturing equipment, and identifying promising customers.
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
Accelerating Production Machine Learning with MLflow with Matei ZahariaDatabricks
Successfully building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with 45 contributors and new features new multiple language APIs, integrations with popular ML libraries, and storage backends. I’ll go through some of the newly released features and explain how to get started with MLflow.
This document discusses auditing reactive applications to detect blocking API calls. It describes how blocking calls can negatively impact performance in reactive systems by consuming thread pools. Various techniques for detecting blocking calls are examined, including modifying the JDK, generating warnings during compilation, and instrumenting code at runtime using a JVM agent. Aspect programming is highlighted as a way to audit applications at load time by weaving in checks for over 500 blocking methods across many Java APIs. The reactive-audit tool is introduced as an open source project for helping developers test for blocking calls in frameworks like Play, Jetty, and Akka.
Spark and Mesos cluster optimization was discussed. The key points were:
1. Spark concepts like stages, tasks, and partitions were explained to understand application behavior and optimization opportunities around shuffling.
2. Application optimization focused on reducing shuffling through techniques like partitioning, reducing object sizes, and optimizing closures.
3. Memory tuning in Spark involved configuring storage and shuffling fractions to control memory usage between user data and Spark's internal data.
4. When running Spark on Mesos, coarse-grained and fine-grained allocation modes were described along with solutions like using Mesos roles to control resource allocation and dynamic allocation in coarse-grained mode.
The Other 99% of a Data Science ProjectEugene Mandel
Slides from my talk at Open Data Science Conference 2016.
Algorithms and models are an important (and cool) part of data science. This talk is about all the other steps that it takes to deploy a data science project that makes a product slightly smarter. Stuff that you hear from practitioners, but is not covered well enough in books.
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Kai Wähner
This document provides an overview of streaming analytics and compares different streaming analytics frameworks. It begins with real-world use cases in various industries and then defines what a data stream is. The core components of a streaming analytics processing pipeline are described, including ingestion, preprocessing, and real-time and batch processing. Popular open-source frameworks like Apache Storm and AWS Kinesis are highlighted. The document concludes by noting that both streaming analytics frameworks and products are growing significantly to enable real-time analytics on streaming data.
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)Stephen Chin
JavaFX 2.0 is the next version of a revolutionary rich client platform for developing immersive desktop applications. One of the new features in JavaFX 2.0 is a set of pure Java APIs that can be used from any JVM language, opening up tremendous possibilities. This presentation demonstrates the benefits of using JavaFX 2.0 together with the Scala programming language to provide a type-safe declarative syntax with support for lazy bindings and collections. Advanced language features, such as DelayedInit and @specialized will be discussed, as will ways of forcing prioritization of implicit conversions for n-level cases. Those who survive the pure technical geekiness of this talk will be rewarded with plenty of JavaFX UI eye candy.
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
This document provides an overview of how to apply big data analytics and machine learning to real-time processing. It discusses machine learning and big data analytics to analyze historical data and build models. These models can then be used in real-time processing without needing to be rebuilt, to take automated actions based on incoming data. The agenda includes sections on machine learning, analysis of historical data, real-time processing, and a live demo.
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
Parquet is a columnar storage format for Hadoop data. It was developed collaboratively by Twitter and Cloudera to address the need for efficient analytics on large datasets. Parquet provides more efficient compression and I/O compared to row-based formats by only reading and decompressing the columns needed by a query. It has been adopted by many companies for analytics workloads involving terabytes to petabytes of data. Parquet is language-independent and supports integration with frameworks like Hive, Pig, and Impala. It provides significant performance improvements and storage savings compared to traditional row-based formats.
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
Apache Kafka is a distributed streaming platform that allows for building real-time data pipelines and streaming apps. It provides a publish-subscribe messaging system with persistence that allows for building real-time streaming applications. Producers publish data to topics which are divided into partitions. Consumers subscribe to topics and process the streaming data. The system handles scaling and data distribution to allow for high throughput and fault tolerance.
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
Slides from my talk at Codemotion Rome in March 2017. Development of analytic machine learning / deep learning models with R, Apache Spark ML, Tensorflow, H2O.ai, RapidMinder, KNIME and TIBCO Spotfire. Deployment to real time event processing / stream processing / streaming analytics engines like Apache Spark Streaming, Apache Flink, Kafka Streams, TIBCO StreamBase.
This document summarizes various projects from different industries including web development, online business, electronics, home appliances, automotive, tourism, consulting, movie festivals, education, food processing, telecommunications, finance, and travel. It provides brief descriptions and key metrics for each project related to website development, social media, marketing, software, and technology solutions.
La recreación es importante para el equilibrio y bienestar de las personas. Proporciona diversión y alivio del estrés asociado con las responsabilidades laborales u otras obligaciones. Existen diferentes tipos de recreación como los deportes, las artes y la vida al aire libre. La recreación tiene beneficios físicos, mentales y sociales como mejorar la salud, reducir el estrés y fomentar la cooperación entre las personas.
This document provides information about an assignment for an MBA course on Internal Audit and Control. It includes 6 questions related to distinguishing between types of audits, similarities and differences between internal and external audits, quality control policies for audit firms, principles of internal control, problems with electronic data processing related to internal control, and factors for an effective internal control system in a bank. Students are to answer the questions in approximately 400 words each for a total of 60 marks. The assignment can be purchased by emailing or calling the provided contact information for Rs. 125 per question.
This document discusses artificial intelligence and machine learning. It provides a brief history of AI from the Perceptron model in 1958 to modern deep learning approaches. It then discusses several applications of machine learning like image classification, medical diagnosis, and autonomous vehicles. It also discusses challenges like distributed machine learning and hidden technical debt. Finally, it provides examples of how AI can be applied to commerce and automotive use cases.
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...Amazon Web Services
Customers are adopting Apache Spark ‒ an open-source distributed processing framework ‒ on Amazon EMR for large-scale machine learning workloads, especially for applications that power customer segmentation and content recommendation. By leveraging Spark ML, a set of machine learning algorithms included with Spark, customers can quickly build and execute massively parallel machine learning jobs. Additionally, Spark applications can train models in streaming or batch contexts, and can access data from Amazon S3, Amazon Kinesis, Amazon Redshift, and other services. This session explains how to quickly and easily create scalable Spark clusters with Amazon EMR, build and share models using Apache Zeppelin and Jupyter notebooks, and use the Spark ML pipelines API to manage your training workflow. In addition, Jasjeet Thind, Senior Director of Data Science and Engineering at Zillow Group, will discuss his organization's development of personalization algorithms and platforms at scale using Spark on Amazon EMR.
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.
In this session you will learn:
* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
Presented the hands-on session on “Introduction to Big Data Analysis” at Dayananda Sagar University. Around 150+ University students benefitted from this session.
Presented at IDEAS SoCal on Oct 20, 2018. I discuss main approaches of deploying data science engines to production and provide sample code for the comprehensive approach of real time scoring with MLeap and Spark ML.
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...Amazon Web Services
FINRA’s analytics platform unlocks the value in capital markets data by accelerating trade analytics and providing a foundation for machine learning at scale. The platform enables FINRA’s analysts to perform discovery on petabytes of trade data to identify instances of potential fraud, market manipulation, and insider trading. By centralizing all data in S3, FINRA’s architecture offers improved agility, scalability, and cost effectiveness. Analytics services such as Amazon EMR and Amazon Redshift have freed FINRA’s data scientists from the constraints of desktop tools, allowing them to apply machine learning techniques to develop and test new surveillance patterns. All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator. At the end of this session, you’ll have an understanding of how to apply FINRA’s architecture to trade analytics and other financial services use cases, including meeting regulatory requirements such as the Consolidated Audit Trail (CAT) reporting.
Sanmitra Ijeri is a second year Masters student in Computer Science specializing in Machine Learning at UC San Diego. During an internship at Salesforce, she built prototypes for lead scoring and opportunity scoring models using algorithms like Naive Bayes, Logistic Regression, N-grams, and neural networks. Previously, she worked as a Senior Software Developer at D. E. Shaw & Co where she developed various applications and tools using technologies like Python, Java, and machine learning algorithms.
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
With Hadoop, we can easily process data from the disk, but this consumes a lot of time. The value of certain insights, such as a traffic alerts or heart attack alerts, degrades with time and handling this time sensitive data needs realtime technologies that can produce output within milliseconds. Moreover, some use cases need advanced analytics like machine learning.
In this talk, we will discuss about the WSO2 Data Analytics platform that brings together all the technologies into one platform. It lets you collect data through a one sensor API, process it using batch, realtime or predictive technologies and communicate your results all within a single platform and user experience.
Presenter:
Srinath Perera
Vice President – Research,
WSO2
This document is a resume for Yu Wang, who is pursuing an MS in Computer Science from UT Dallas with a 3.35 GPA. Wang has experience in web development, big data, databases, and programming languages like Java, C#, Python, R and SQL. He is looking for a summer/fall 2016 internship in computer science. Some of Wang's projects include developing predictive models using Spark and machine learning algorithms, building web applications using ASP.NET and AngularJS, and performing data analysis on large datasets with tools like Hadoop, Pig, and Hive.
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Building an AI-Powered Retail Experience with Delta Lake, Spark, and DatabricksDatabricks
Zalando SE is Europe’s leading online fashion platform and connects customers, brands and partners. With millions of visitors each month, we have petabytes of purchase, click-stream, product and other data in our data lake. This data is crucial to powering insights on shopper behavior and driving an AI-first strategy to improve site engagement.
Over 7 months ago, Zalando adopted Apache Spark, Delta Lake and Databricks as its de-facto computation platform for analytics and machine learning. During this period, we onboarded well over 50 internal teams ranging from BI teams, with no knowledge of Spark or big data running ETL pipelines to AI/ML teams already using EMR and Spark for heavy model training. Provided the spectrum of varied business problems they were trying to solve, we worked with each team individually, understanding their use cases, helping them validate assumptions, developing working code and taking them to production. In this talk we will share best practices for building a unified data and analytics architecture on Databricks, lessons learned rolling it out across the organization and provide a deep dive on AI & Analytics use cases in the fashion ecommerce space.
Connecting ML and online services takes effort, can't be successfully done without cross-functional work between Data Science (pushing the innovation envelope) and Software Engineering practices.
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoAmazon Web Services
The document provides an overview of Amazon Web Services (AWS) Elastic MapReduce (EMR) capabilities. It discusses how EMR allows customers to process vast amounts of data using Hadoop/Spark clusters in AWS without having to stand up and manage their own hardware. Examples are given of how companies like Netflix, Foursquare, and Anthropic use EMR for big data processing tasks like recommendations, analytics, and machine learning. The document highlights benefits of EMR like ease of use, flexibility, and cost savings compared to on-premises clusters.
Low Code Platform To Build Data & AI ProductsGramener
Gramener's CEO, Anand S conducted this webinar where he explained how to build Data and AI products using a low-code platform in less than two weeks.
Few takeaways:
-How low-code approaches can be tailored to your data/digital needs?
-Decisions on Building vs. Buying
-Production-ready use cases to stimulate your thinking
Who should watch?
You will find this webinar to be valuable if you're a CPO, VP IT, handling product development, or building analytical solutions for your company.
Watch this full webinar on: https://info.gramener.com/low-code-platform-to-build-process-optimization-solutions?
Want to know more about our low-code platform, Gramex?
Visit: https://gramener.com/gramex/
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleSaurabh Verma
This document summarizes a company's transition from a SQL database to a native graph database to power their identity resolution product. It describes the requirements of high read and write throughput and complex queries over billions of identities and linkages. It then outlines the evaluation of several graph databases, with JanusGraph on ScyllaDB performing the best. Key findings from prototyping include handling high query volume, managing supernodes, and tuning compaction strategies. The production implementation and architecture is also summarized.
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
The document provides an overview of an introductory course on artificial intelligence (AI), machine learning (ML), and deep learning (DL). Some key details include:
- The course title is AI (Machine Learning / Deep Learning) and runs for 6 months.
- The course aims to provide employable skills in AI programming, data science, deep learning, computer vision, natural language processing, and ML operations.
- Learning outcomes cover topics like AI fundamentals, data analytics, deep learning, computer vision, natural language processing, and core skills.
- The course prepares students for jobs like Python developer, data analyst, machine learning engineer, and more.
This document discusses interactive analytics for human timescales using feature sequences to calculate non-additive metrics like instant overlaps between large user groups. It describes Yahoo's advertising data warehouse that handles petabytes of data daily and provides normalized views and analytics across systems in milliseconds. Custom algorithms like feature sequence encoding enable exact overlap calculations in under a minute for billions of user events, compared to 19 hours for existing approaches.
Two decades ago Extreme Programming revolutionized software development with a set of principles and practices that help to improve product quality, user experience, efficiency and well-being of teams. In this presentation we will discuss how such a methodology can be even more important to deliver valuable and reliable Data Science products meeting ever-growing speed-to-market expectations.
Il y a 20 ans, l’Extreme Programming était un framework novateur avec des pratiques de génie logiciel sans lesquelles nous n’imaginerions plus travailler aujourd’hui pour produire des logiciels de qualité. Dans cette présentation, nous découvrirons comment les pratiques de l’Extrême Data-Science, qui se reposent sur les épaules du géant Extrême Programming, nous permettent d’intégrer avec succès les data-scientists et leurs projets dans les équipes, et aident à assurer la qualité des livrables data-science qui offrent des fonctionnalités optimales pour l’utilisateur.
Make Data Science Great Again. Pourquoi et comment crafter la Data Science su...Anastasia Bobyreva
Il n'est pas évident d'intégrer de la Data Science dans les sociétés qui développent un business qui de base ne prévoyait pas de l’intelligence artificielle (IA), et pour lequel l’IA n'est pas au cœur du métier. Malgré la motivation d'utiliser l’IA, de nombreux projets Data Science dans ces sociétés échouent.
C'est autant frustrant pour les responsables d'entreprises que démotivant pour les data-scientists, dont les projets finissent au placard. On va analyser ensemble cette situation, pour déterminer les raisons de ces échecs. On va également étudier comment éviter les erreurs les plus courantes, et comment mener ce changement sans encombre afin d'enrichir vos produits avec l’IA.
L’objectif du talk est que peu importe le profil que vous avez - dev front, dev back, data-scientist, CTO, CEO, Product Manager - vous retournerez lundi dans votre société en sachant à la fois identifier et mener à bien les opportunités de Data Science.
Presentation of Learn Link, the first social network that aims to connect people depending on what they want to learn or teach, and boost the motivation during the learning.
https://twitter.com/swmtp/status/1005849400466464768
Thanks to my great teammates (slide 14) for their work and motivation !
Google voice transcriptions demystified: Introduction to recurrent neural ne...Anastasia Bobyreva
Introduction to LSTM, the deep learning algorithm behind Google Voice Transcriptions, explained without any mathematics equation. Mostly for non-technical audience without any data-science background.
Big Data Science in Scala ( Joker 2017, slides in Russian)Anastasia Bobyreva
«Нужно бежать со всех ног, чтобы только оставаться на месте, а чтобы куда-то попасть, надо бежать как минимум вдвое быстрее!» — data scientist в Стране Чудес.
Наука о данных вынуждена, как минимум, идти в ногу с постоянно увеличивающимися объемами и сложностью данных, а в идеальном случае стараться опережать и предупреждать потенциальные проблемы, возникающие при их обработке.
В этом докладе вы увидите, как Scala-библиотеки Saddle, Smile и Spark помогают науке о данных отвечать постоянно эволюционирующим требованиям инфраструктуры, облегчая анализ и расширяя возможности описательной статистики, обработки данных и машинного обучения. В этом им помогают функциональные аспекты языка Scala, его благоприятная экосистема больших данных и гибридность с объектно-ориентированным программированием.
На примере предсказания кликов на рекламных пространствах интернета мы исследуем с вами возможности, преимущества и пути развития Scala для науки о данных.
How to get the best of both worlds : Big Data and Data Science?
Run Deep Learning on Spark easily with BigDL library!
Slides of my short conference, introduction to BigDL, for Christmas JUG event in Montpellier
Which library should you choose for data-science? That's the question!Anastasia Bobyreva
This talk presents you the data-science ecosystem in two languages : Python and Scala. It demonstrates the use of their libraries on real dataset to solve binary classification problem with decision tree algorithm.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
7. Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:
10. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
11. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
12. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
13. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
14. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
15. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
3.0 6.0 1.0
5.0 3.0 5.0
1.0 2.0 3.0
16. Preprocessing: Spark ml
Extraction: Extracting features from “raw” data
Transformation: Scaling, converting, or modifying features
Selection: Selecting a subset from a larger set of features
17. Preprocessing: Spark ml
Extraction: Extracting features from “raw” data
TF-IDF, SparkSQL
Transformation: Scaling, converting, or modifying features
Bucketizer, String Indexer, Index to String, Vector Assembler
Selection: Selecting a subset from a larger set of features
ChiSqSelector
18. Preprocessing: Saddle
array-backed, specialized data structures:
Pandas-like operations:
dealing with missing values
index transformation tools
extracting,slicing,mapping
row/column wise
groupBy/join/concat
sorting/pivoting
20. Learning: Spark ml
Dataframe-based API
Pipeline interface
Classification
Regression
Linear Methods
Decision Trees
Tree ensembles
TF-IDF String Indexer Assembler Random Forest Evaluation