In this Strata 2018 presentation, Ted Malaska and Mark Grover discuss how to make the most of big data at speed.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/72396
Near real-time anomaly detection at Lyftmarkgrover
Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
In this Strata 2018 presentation, Ted Malaska and Mark Grover discuss how to make the most of big data at speed.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/72396
Near real-time anomaly detection at Lyftmarkgrover
Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
Gender Prediction with Databricks AutoML PipelineDatabricks
As the nation’s leading advocate for people aged 50+, each month AARP conducts thousands of campaigns made up of hundreds of millions of emails, mails and phone calls to over 37 million members and a broader universe of non-members. Missing information on demographics results in less accurate profiling and targeting strategies. For example, there are 1.5 Million active members and 15 Million expired members missing gender information in AARP’s database. The Name gender model is a use case where AARP Data Analytics team utilized the Databricks Lakehouse platform to create a fully automated machine learning model. The Random Forest Classifier used 800 thousand existing distinct first names, ages, and over 700 variables derived from the letter composition of first names to predict gender. It leveraged MLflow to track the accuracy of models and log the metrics overtime, and registered multiple model versions to pick the best for production. Model training and scoring were scheduled and the auto ML pipeline significantly minimized manual working hours after the initial set up. As a result, AARP dramatically (and accurately) improved the coverage of gender information from about 92.5% to 99.5%.
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
- Speaker: Hervé Vũ Roussel - CEO & Co-founder @ QuodAI
- Vài nét về speaker: Hervé Vũ Roussel trước đây đã từng là CTO của một công ty phần mềm ở Silicon Valley Mỹ. Anh đã và đang là advisor và mentor cho nhiều tổ chức như IBM AI XPRIZE, PlatoHQ (YC'16), RMIT, AngelHack, ... Anh cũng là một trong các diễn giả thường xuyên cho chủ đề AI và Software engineer cũng như đã tư vấn cho nhiều trường đại học, công ty về các chương trình đào tạo khoa học máy tính và kỹ sư phần mềm. Hiện tại, Hervé đang là CEO của Quod AI, một nền tảng giúp giải thích source code bằng ngôn ngữ tự nhiên.
Đến với talk lần này anh sẽ chia sẻ kinh nghiệm của mình trong việc thiết kế một kiến trúc chịu tải cao và dễ mở rộng (highly scalable architecture) cho các nền tảng AI bao gồm:
- Những nguyên tắc nền tảng trong xây dựng kiến trúc phần mềm
- Cách lựa chọn công nghệ lưu trữ dữ liệu
- Xây dựng data pipelines bất đồng bộ
Zipline - A Declarative Feature Engineering FrameworkDatabricks
Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks.
Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers ...Databricks
Personalized product recommendation, the selection of cross-sell, customer churn and purchase prediction is becoming more and more important for e-commerical companies, such as JD.com, which is one of the world’s largest B2C online retailers with more than 300 million active users. Apache Spark is a powerful tool as a standard framework/API for machine learning feature generation, model training and model evaluation. Thus, JD machine learning engineers can easily experiment with ML pipelines and rapidly adjust model to meet demand.
Spark MLlib provides lots of useful machine learning algorithms with high performance implementation. However, there is still gap between our requirement and the existing MLlib algorithms. We also implemented and open-sourced some algorithms, such as a new GBM (gradient boosting machine) implementation named SparkGBM, in which a lot of valuable experience of XGBoost and LightGBM was learned. Under the Spark MLlib pipeline framework, we can plugin our algorithm into the whole ML pipeline and benefit from the unified machine learning platform. In this talk, we will discuss how Apache Spark to prototype machine learning models for user segmentation, targeting and personalization at our group, the revenue growth it brings, and the lessons we learned.
We’ll also share the various pitfalls we have countered when using Apache Spark, and practical experience we collaborated with community to overcome the difficulties.
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
The state of the art in productionizing machine Learning models today primarily addresses building RESTful APIs. In the Digital Ecosystem, RESTful APIs are a necessary, but not sufficient, part of the complete solution for productionizing ML models. And according to recent research by the McKinsey Global Institute, applying AI in marketing and sales has the most potential value.
In the digital ecosystem, productionizing ML models at an accelerated pace becomes easy with:
Feature Store with commonly used features that is available for all data scientists
Feature Stores that distill visitor behavior is ready to use feature vectors in a semi supervised manner
Data pipeline that can support the challenging demands of the digital ecosystem to feed the Feature Store on an ongoing basis
Pipeline templates that support the challenging demands of the digital ecosystem that feed feature store, predict and distribute predictions on an ongoing basis. With these, a major electronics manufacturer was able to build and productionize a new model in 3 weeks.
The use case for the model is retargeting advertising; it analyzes the behavior of website visitors and builds customized audiences of the visitors that are most likely to purchase 9 different products. Using the model, this manufacturer was able to maintain the same level of purchases with half of the retargeting media spend -increasing the efficiency of their marketing spend by 100%.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do you deploy these ML model to a production environment? How do you embed what you’ve learned into customer facing data applications?
In this talk I will discuss best practices on how data scientists productionize machine learning models, do a deep dive with actual case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
Epistatica is a data science spin-off from VIA SMS R&D SERVICES, searching its niche in European markets.
Dmitrijs is head of credit risk with VIA SMS R&D SERVICES, a fintech company, and member of the board at Epistatica, holds a PhD from RAS Institute for Information Transmission Problems and analyzed data for over 12 years.
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
Gender Prediction with Databricks AutoML PipelineDatabricks
As the nation’s leading advocate for people aged 50+, each month AARP conducts thousands of campaigns made up of hundreds of millions of emails, mails and phone calls to over 37 million members and a broader universe of non-members. Missing information on demographics results in less accurate profiling and targeting strategies. For example, there are 1.5 Million active members and 15 Million expired members missing gender information in AARP’s database. The Name gender model is a use case where AARP Data Analytics team utilized the Databricks Lakehouse platform to create a fully automated machine learning model. The Random Forest Classifier used 800 thousand existing distinct first names, ages, and over 700 variables derived from the letter composition of first names to predict gender. It leveraged MLflow to track the accuracy of models and log the metrics overtime, and registered multiple model versions to pick the best for production. Model training and scoring were scheduled and the auto ML pipeline significantly minimized manual working hours after the initial set up. As a result, AARP dramatically (and accurately) improved the coverage of gender information from about 92.5% to 99.5%.
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
- Speaker: Hervé Vũ Roussel - CEO & Co-founder @ QuodAI
- Vài nét về speaker: Hervé Vũ Roussel trước đây đã từng là CTO của một công ty phần mềm ở Silicon Valley Mỹ. Anh đã và đang là advisor và mentor cho nhiều tổ chức như IBM AI XPRIZE, PlatoHQ (YC'16), RMIT, AngelHack, ... Anh cũng là một trong các diễn giả thường xuyên cho chủ đề AI và Software engineer cũng như đã tư vấn cho nhiều trường đại học, công ty về các chương trình đào tạo khoa học máy tính và kỹ sư phần mềm. Hiện tại, Hervé đang là CEO của Quod AI, một nền tảng giúp giải thích source code bằng ngôn ngữ tự nhiên.
Đến với talk lần này anh sẽ chia sẻ kinh nghiệm của mình trong việc thiết kế một kiến trúc chịu tải cao và dễ mở rộng (highly scalable architecture) cho các nền tảng AI bao gồm:
- Những nguyên tắc nền tảng trong xây dựng kiến trúc phần mềm
- Cách lựa chọn công nghệ lưu trữ dữ liệu
- Xây dựng data pipelines bất đồng bộ
Zipline - A Declarative Feature Engineering FrameworkDatabricks
Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks.
Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers ...Databricks
Personalized product recommendation, the selection of cross-sell, customer churn and purchase prediction is becoming more and more important for e-commerical companies, such as JD.com, which is one of the world’s largest B2C online retailers with more than 300 million active users. Apache Spark is a powerful tool as a standard framework/API for machine learning feature generation, model training and model evaluation. Thus, JD machine learning engineers can easily experiment with ML pipelines and rapidly adjust model to meet demand.
Spark MLlib provides lots of useful machine learning algorithms with high performance implementation. However, there is still gap between our requirement and the existing MLlib algorithms. We also implemented and open-sourced some algorithms, such as a new GBM (gradient boosting machine) implementation named SparkGBM, in which a lot of valuable experience of XGBoost and LightGBM was learned. Under the Spark MLlib pipeline framework, we can plugin our algorithm into the whole ML pipeline and benefit from the unified machine learning platform. In this talk, we will discuss how Apache Spark to prototype machine learning models for user segmentation, targeting and personalization at our group, the revenue growth it brings, and the lessons we learned.
We’ll also share the various pitfalls we have countered when using Apache Spark, and practical experience we collaborated with community to overcome the difficulties.
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
The state of the art in productionizing machine Learning models today primarily addresses building RESTful APIs. In the Digital Ecosystem, RESTful APIs are a necessary, but not sufficient, part of the complete solution for productionizing ML models. And according to recent research by the McKinsey Global Institute, applying AI in marketing and sales has the most potential value.
In the digital ecosystem, productionizing ML models at an accelerated pace becomes easy with:
Feature Store with commonly used features that is available for all data scientists
Feature Stores that distill visitor behavior is ready to use feature vectors in a semi supervised manner
Data pipeline that can support the challenging demands of the digital ecosystem to feed the Feature Store on an ongoing basis
Pipeline templates that support the challenging demands of the digital ecosystem that feed feature store, predict and distribute predictions on an ongoing basis. With these, a major electronics manufacturer was able to build and productionize a new model in 3 weeks.
The use case for the model is retargeting advertising; it analyzes the behavior of website visitors and builds customized audiences of the visitors that are most likely to purchase 9 different products. Using the model, this manufacturer was able to maintain the same level of purchases with half of the retargeting media spend -increasing the efficiency of their marketing spend by 100%.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do you deploy these ML model to a production environment? How do you embed what you’ve learned into customer facing data applications?
In this talk I will discuss best practices on how data scientists productionize machine learning models, do a deep dive with actual case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
Epistatica is a data science spin-off from VIA SMS R&D SERVICES, searching its niche in European markets.
Dmitrijs is head of credit risk with VIA SMS R&D SERVICES, a fintech company, and member of the board at Epistatica, holds a PhD from RAS Institute for Information Transmission Problems and analyzed data for over 12 years.
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
Learn Like a Human: Taking Machine Learning from Batch to Real-TimeDynamic Yield
Elad Rosenheim, our Chief Software Architect, gave a talk about machine learning and conversion optimization at eXelate's "Big Data eXposed 2015" event in Tel-Aviv. In his presentation, Elad touched on some of the most common problems encountered when trying to optimize websites for better conversion rates, and covered a wide range of solutions - starting with classic A/B testing (unsuitable for rapidly changing content), through Multi-arm Bandits, and finally Contextual Bandits – those online machine learning algorithms that allow for true personalization in real time.
Sample Codes: https://github.com/davegautam/dotnetconfsamplecodes
Presentation on How you can get started with ML.NET. If you are existing .NET Stack Developer and Wanna use the same technology into Machine Learning, this slide focuses on how you can use ML.NET for Machine Learning.
Recent Gartner and Capgemini studies predict only around 25% of data science projects are successful and only around 15% make it to full-scale production. Of these, many degrade in performance and produce disappointing results within months of implementation. How can focusing on the desired business outcomes and business use cases throughout a data science project help overcome the odds?
Horizon: Deep Reinforcement Learning at ScaleDatabricks
To build a decision-making system, we must provide answers to two sets of questions: (1) ""What will happen if I make decision X?"" and (2) ""How should I pick which decision to make?"".
Typically, the first set of questions are answered with supervised learning: we build models to forecast whether someone will click on an ad, or visit a post. The second set of questions are more open-ended. In this talk, we will dive into how we can answer ""how"" questions, starting with heuristics and search. This will lead us to bandits, reinforcement learning, and Horizon: an open-source platform for training and deploying reinforcement learning models at massive scale. At Facebook, we are using Horizon, built using PyTorch 1.0 and Apache Spark, in a variety of AI-related and control tasks, spanning recommender systems, marketing & promotion distribution, and bandwidth optimization.
The talk will cover the key components of Horizon and the lessons we learned along the way that influenced the development of the platform.
Author: Jason Gauci
Machine learning has become an important tool in the modern software toolbox, and high-performing organizations are increasingly coming to rely on data science and machine learning as a core part of their business. eBay introduced machine learning to its commerce search ranking and drove double-digit increases in revenue. Stitch Fix built a multibillion dollar clothing retail business in the US by combining the best of machines with the best of humans. And WeWork is bringing machine-learned approaches to the physical office environment all around the world. In all cases, algorithmic techniques started simple and slowly became more sophisticated over time. This talk will use these examples to derive an agile approach to machine learning, and will explore that approach across several different dimensions. We will set the stage by outlining the kinds of problems that are most amenable to machine-learned approaches as well as describing some important prerequisites, including investments in data quality, a robust data pipeline, and experimental discipline. Next, we will choose the right (algorithmic) tool for the right job, and suggest how to incrementally evolve the algorithmic approaches we bring to bear. Most fancy cutting-edge recommender systems in the real world, for example, started out with simple rules-based techniques or basic regression. Finally, we will integrate machine learning into the broader product development process, and see how it can help us to accelerate business results
Travis Cox, Kathy Applebaum, and Kevin McClusky from Inductive Automation will discuss key concepts and best practices, show demos, and answer questions from the audience, to help you start integrating ML into your day-to-day processes.
Learn more about:
• Practical ways to use ML in your factory or facility
• What you'll need to get started
• Existing ML tools and platforms
• And more
Travis Cox, Kathy Applebaum, and Kevin McClusky from Inductive Automation will discuss key concepts and best practices, show demos, and answer questions from the audience, to help you start integrating ML into your day-to-day processes.
Learn more about:
• Practical ways to use ML in your factory or facility
• What you'll need to get started
• Existing ML tools and platforms
• And more
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
The slide contains some high level information about some machine learning algorithms, cross validation and feature extraction techniques. It also contains high level techniques about high available and scalable ML products.
"The proposed system overcomes the above mentioned issue in an efficient way. It aims at analyzing the number of fraud transactions that are present in the dataset.
"
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
How can you ensure that your work and use of ML gets the most impact in the domain you apply it to? From collaborating with all stake-holders to simulating how predictions will really be used, evaluating them domain-side and deploying models at scale in production, I’ll share some of the lessons I’ve learnt when it comes to integrate ML in real-world applications. Also, I’ll review some research problems and new open source software aimed at making it easier to create, experiment with, and operationalise predictive models.
AWS re:Invent 2016: Leverage the Power of the Crowd To Work with Amazon Mecha...Amazon Web Services
With Amazon Mechanical Turk (MTurk), you can leverage the power of the crowd for a host of tasks ranging from image moderation and video transcription to data collection and user testing. You simply build a process that submit tasks to the Mechanical Turk marketplace and get results quickly, accurately, and at scale. In this session, Russ, from Rainforest QA, shares best practices and lessons learned from his experience using MTurk. The session covers the key concepts of MTurk, getting started as a Requester, and using MTurk via the API. You learn how to set and manage Worker incentives, achieve great Worker quality, and how to integrate and scale your crowdsourced application. By the end of this session, you will have a comprehensive understanding of MTurk and know how to get started harnessing the power of the crowd.
The Fine Art of Combining Capacity Management with Machine LearningPrecisely
Today, capacity management within the enterprise continues to evolve. In the past, we were focused on the hardware – but now we are focused on the services. With that in mind, the amount of data available has increased significantly and has become difficult for individuals to sort through.
It is apparent that to be successful in this discipline, we need the machines to do more of the heavy lifting. This includes automatically creating reports, calling out anomalies and producing forecasts. The intuition of the human computer is imperative to the success.
View this webinar on-demand where we discuss:
• The strengths and weaknesses of capacity management with and without machine learning
• What machine learning can provide throughout the process
• The benefits of using capacity management and machine learning within your organization
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
Large scale data processing analyses and makes sense of large amounts of data. Although the field itself is not new, it is finding many usecases under the theme "Bigdata" where Google itself, IBM Watson, and Google's Driverless car are some of success stories. Spanning many fields, Large scale data processing brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture. Some usecases like Urban Planning can be slow, which is done in batch mode, while others like stock markets need results within Milliseconds, which are done in streaming fashion. There are different technologies for each case: MapReduce for batch processing and Complex Event Processing and Stream Processing for real-time usecases. Furthermore, the type of analysis range from basic statistics like mean to complicated prediction models based on machine Learning. In this talk, we will discuss data processing landscape: concepts, usecases, technologies and open questions while drawing examples from real world scenarios.
http://icter.org/conference/invited_speeches
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
The affect of service quality and online reviews on customer loyalty in the E...
Thomas Jensen. Machine Learning
1. The Impact of Big Data on
Classic Machine Learning
Algorithms
Thomas Jensen, Senior Business Analyst @ Expedia
2. Who am I?
• Senior Business Analyst @ Expedia
• Working within the competitive
intelligence unit
• Responsible for :
• Algorithm that score new hotels
• Algorithm that predicts room nights
sold on existing Expedia hotels
• Scraping competitor sites
• Other stuff….
3. The Promise of Big Data
Real time data
Data driven decision
More accurate and
robust models
Granularity
4. Big Data Challenges
Data Processing – not
going to talk about
this.
Speed at which to use
data – how fast should
we update
algorithms?
How do we train
algorithms on data
sets that do not fit
into memory?
6. Classification - Logistic Regression
• One classic task in machine learning / statistics is to classify some
objects/events/decisions correctly
• Examples are:
• Customer churn
• Click behavior
• Purchase behavior
• ….
• One of the most popular algorithms to carry out these tasks is logistic
regression
7. What is logistic regression?
• Logistic regression attaches probabilities to individual outcomes,
showing how likely they are to belong to one class or the other
• Pr 𝑦 𝑥 =
1
1+𝑒−𝑥𝛽
• The challenge is to choose the
optimal beta(s)
• To do that we minimize a cost
function
8. Why Use Logistic Regression?
• It is simple and well understood algorithm
• Outputs probabilities
• There are tried and tested models to estimate the parameters
• It is flexible – can handle a number of different inputs, and feature
transformations
9. Usual Approaches
• Batch training (offline approach)
• Get all the data and train the algorithm in one go
• Disadvantages when data is big
• Requires all data to be loaded into memory
• Periodic retraining is necessary
• Very time consuming with big data!
11. Examples of Logistic Regression in Industry
Settings – Real Time Bidding
• RTB
• RTB algorithms are usually
based on logistic regression
• Whether or not to bid on a
user is determined by the
probability that the user will
click on an add
• Each day billions of bids are
processed
• Each bid has to be processed
within 80 milliseconds
12. Examples of Logistic Regression in Industry
Settings – Fraud Detection
Detecting Fraudulent Credit Card
Transactions
• The probability that a transaction
is using a stolen credit card is
typically estimated with logistic
regression
• Billions of transactions are
analyzed each day
13. How Slow is the Batch Version of Logistic
Regression?
One target variable and two feature vectors.
All randomly generated.
15. A Real World Problem
• Some stats on the training job in the pipeline:
• Runs training jobs on a per country basis
• Longest running job lasts ~9 hours
• Shortest running job lasts ~3 hours
• There are often convergence failures
• What we need an algorithm that:
• Can reduce training time
• Is robust towards convergence failures
16. A Big Data Friendly Approach
Online Training
• Pass each data point sequentially through the algorithm
• Only requires one data point at a time in memory
• Allows for on-the-fly training of the algorithm
17. Online Learning
• We want to learn a vector of
weights
• Initialize all weights. Begin loop:
1. Get training example
2. Make a prediction for the target
variable
3. Learn the true value of the
target
4. Update the weights and go to 1
18. Online Learning
• Initialise all weights. Begin loop:
Repeat {
For i = 1 to m {
𝜃𝑗 = 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃 𝑗
𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖, 𝑦𝑖))
}
}
the partial derivative
of the cost functions
the cost function – given
theta and row i, i.e. how wrong
Are we?
the step size – how fast
we should climb the
gradient
19. Online Learning
• Approaches the maximum of the function in a jumpy manner and
never actually settles on the maximum.
20. Batch vs. Online Learning
Data
Size: 4.8GB
Rows: 500,000
Columns: 5000
0
20
40
60
80
100
120
Batch SGDClassifier Sofia-ml
Training
*Times include reading data and training algorithm
21. Online Learning Vs. Batch
Online Learning
• When we have a continuous
stream of data
• When It is important to update
the algorithm in real time – can
hit a moving target
• When training speed is
important
• Parameters are “jumpy” around
the optimal values
Batch
• When it is very important to get
the exact optimal values
• When data can fit in memory
• When training time is not of the
essence
22. Popular Online Learning Libraries
• Sofia-ml (c/c++)
• Requires data in svmLight format
• Have implementations of SVM, Neural networks and logistic regression
• Supports classification and ranking
• Wovbal wabbit (c/c++)
• Requires data in own wv format
• Have implementations of the most popular loss functions
• Supports classification, ranking and regression
• Pandas + scikit-learn (python)
• Pandas has a nice function for reading files in batches
• Can handle sparse and non-sparse matrices
• Scikit–learn has an SGD classifier that can fit the model in batches
• Supports classification, ranking and regression