Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
ESIP 2018 - The Case for Archives of ConvenienceDan Pilone
Earth Science data is measured in petabytes and represents decades of data collection, evolution of technology and practices, and provides an unparalleled view of our planet. The pace of change is only accelerating: NASA and other agencies are on their way to making hundreds of Petabytes of data available in the cloud, highly scalable processing and analysis architectures and tools are in active use with more being developed every day, and each of these brings with it opportunities for optimization and innovation. This talk demonstrates leveraging the elastic nature of the cloud using GOES-16 data to create ephemeral Archives of Convenience, targeting individual researcher needs, optimized for their problems and tool suites, instead of trying to settle on a single "cloud optimized" solution.
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...Gary Stafford
We will read and write messages to and from Amazon MSK in Apache Avro format. We will store the Avro-format Kafka message’s key and value schemas in Apicurio Registry and retrieve the schemas instead of hard-coding the schemas in the PySpark scripts. We will also use the registry to store schemas for CSV-format data files.
Link to the blog post and video: https://itnext.io/stream-processing-with-apache-spark-kafka-avro-and-apicurio-registry-on-amazon-emr-and-amazon-13080defa3be
Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
ESIP 2018 - The Case for Archives of ConvenienceDan Pilone
Earth Science data is measured in petabytes and represents decades of data collection, evolution of technology and practices, and provides an unparalleled view of our planet. The pace of change is only accelerating: NASA and other agencies are on their way to making hundreds of Petabytes of data available in the cloud, highly scalable processing and analysis architectures and tools are in active use with more being developed every day, and each of these brings with it opportunities for optimization and innovation. This talk demonstrates leveraging the elastic nature of the cloud using GOES-16 data to create ephemeral Archives of Convenience, targeting individual researcher needs, optimized for their problems and tool suites, instead of trying to settle on a single "cloud optimized" solution.
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...Gary Stafford
We will read and write messages to and from Amazon MSK in Apache Avro format. We will store the Avro-format Kafka message’s key and value schemas in Apicurio Registry and retrieve the schemas instead of hard-coding the schemas in the PySpark scripts. We will also use the registry to store schemas for CSV-format data files.
Link to the blog post and video: https://itnext.io/stream-processing-with-apache-spark-kafka-avro-and-apicurio-registry-on-amazon-emr-and-amazon-13080defa3be
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
Michal Malohlava's presentation on H2O Rains with Databricks Cloud, New York, NY 02.16.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
SAREF in the InterConnect project - ICTOpen 2022 RonaldSiebes2
Presenting Machine Learning research using SAREF in the InterConnect project at the "Smart Cities, Health and AI " track at ICTOpen 2022 (https://www.ictopen.nl/tracks/track-smart-cities-health-and-ai/)
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Databricks
Data production continues to scale up and the techniques for managing it need to scale too. Building pipelines that can process petabytes per day in turn create data lakes with exabytes of historical data. At Databricks, we help our customers turn these data lakes into gold mines of valuable information using Apache Spark. This talk will cover techniques to optimize access to these data lakes using Delta Lakes, including range partitioning, file-based data skipping, multi-dimensional clustering, and read-optimized files. We'll cover sample implementations and see examples of querying petabytes of data in seconds, not hours. We'll also discuss tradeoffs that data engineers deal with everyday like read speed vs. write throughput, managing storage costs, and duplicating data to support multiple query profiles. We'll also discuss combining batch with streaming to achieve desired query performance. After this session, you will have new ideas for managing truly massive Delta Lakes.
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show:
1) How we tried to solve this problem using traditional DW techniques
2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients.
3) Some of the key learnings we had when migrating from DW to Spark.
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
Data science in ruby, is it possible? is it fast? should we use it?Rodrigo Urubatan
Slides used in my presentation at http://thedevelopersconference.com.br in the #ruby track this year in são Paulo,
Talking a little about data science, what are the alternatives to do it in ruby, how to integrate ruby and python and what are the best solutions available.
A MAC URISA event. This talk is oriented to GIS users looking to learn more about the Python programming language. The Python language is incorporated into many GIS applications. Python also has a considerable installation base, with many freely available modules that help developers extend their software to do more.
The beginning third of the talk discusses the history and syntax of the language, along with why a GIS specialist would want to learn how to use the language. The middle of the talk discusses how Python is integrated with the ESRI ArcGIS Desktop suite. The final portion of the talk discusses two Python projects and how they can be used to extend your GIS capabilities and improve efficiency.
Recording of the talk: https://www.youtube.com/watch?v=F1_FqvbXHb4
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
Michal Malohlava's presentation on H2O Rains with Databricks Cloud, New York, NY 02.16.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
SAREF in the InterConnect project - ICTOpen 2022 RonaldSiebes2
Presenting Machine Learning research using SAREF in the InterConnect project at the "Smart Cities, Health and AI " track at ICTOpen 2022 (https://www.ictopen.nl/tracks/track-smart-cities-health-and-ai/)
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...Databricks
Data production continues to scale up and the techniques for managing it need to scale too. Building pipelines that can process petabytes per day in turn create data lakes with exabytes of historical data. At Databricks, we help our customers turn these data lakes into gold mines of valuable information using Apache Spark. This talk will cover techniques to optimize access to these data lakes using Delta Lakes, including range partitioning, file-based data skipping, multi-dimensional clustering, and read-optimized files. We'll cover sample implementations and see examples of querying petabytes of data in seconds, not hours. We'll also discuss tradeoffs that data engineers deal with everyday like read speed vs. write throughput, managing storage costs, and duplicating data to support multiple query profiles. We'll also discuss combining batch with streaming to achieve desired query performance. After this session, you will have new ideas for managing truly massive Delta Lakes.
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show:
1) How we tried to solve this problem using traditional DW techniques
2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients.
3) Some of the key learnings we had when migrating from DW to Spark.
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
Data science in ruby, is it possible? is it fast? should we use it?Rodrigo Urubatan
Slides used in my presentation at http://thedevelopersconference.com.br in the #ruby track this year in são Paulo,
Talking a little about data science, what are the alternatives to do it in ruby, how to integrate ruby and python and what are the best solutions available.
A MAC URISA event. This talk is oriented to GIS users looking to learn more about the Python programming language. The Python language is incorporated into many GIS applications. Python also has a considerable installation base, with many freely available modules that help developers extend their software to do more.
The beginning third of the talk discusses the history and syntax of the language, along with why a GIS specialist would want to learn how to use the language. The middle of the talk discusses how Python is integrated with the ESRI ArcGIS Desktop suite. The final portion of the talk discusses two Python projects and how they can be used to extend your GIS capabilities and improve efficiency.
Recording of the talk: https://www.youtube.com/watch?v=F1_FqvbXHb4
Доклад посвящен экосистеме Cortana Analytics Suite, в т.ч. сервису предиктивной аналитики Azure Machine Learning. В demo-части доклада разбирается задача анализа тональности сообщений в социальных сетях.
Видео выступления и пояснения к demo-доклада доступно на http://0xcode.in/dev-camp
DF1 - ML - Petukhov - Azure Ml Machine Learning as a ServiceMoscowDataFest
Presentation from Moscow Data Fest #1, September 12.
Moscow Data Fest is a free one-day event that brings together Data Scientists for sessions on both theory and practice.
Link: http://www.meetup.com/Moscow-Data-Fest/
Microsoft & Machine Learning / Artificial Intelligenceİbrahim KIVANÇ
In this presentation you'll find Machine Learning / Deep Learning tools and services from Microsoft. Including Azure Machine Learning Workbench, Azure Notebooks, Azure Data Science Virtual Machines and more.
Here are the demos & resources
https://github.com/ikivanc/Azure-ML-Workbench-Iris-Dataset-Classification
https://github.com/ikivanc/Azure-ML-Resources
Join Joseph Sirosh, Corporate Vice President of the Cloud AI Platform, for a deep dive into the AI platform and exciting AI use cases. Joseph will showcase how every developer can infuse intelligence into their applications and create amazing new experiences with AI. In this exciting overview, you will learn about the application of AI technologies in the cloud. We will help you understand how to add pre-built AI capabilities like object detection, face understanding, translation and speech to applications. We will show how developers can build Cognitive Search applications that understand deep content in images, text and other data. We will also show how the platform can be used to build your own custom AI models for predictive applications and how to use the Azure platform to accelerate machine learning. Joseph will also show how companies assemble end-to-end systems of intelligence using the rich variety of data and application development services on Azure.
rosettaHUB federates infrastructures and tools of data science within a highly interactive, responsive and programmable framework. It provides a new generation of virtual real-time collaborative and provenance-aware workbenches, notebooks, and scientific spreadsheets as well as tools for building Python, R, Julia, Scala and Sql-based interactive/collaborative web applications.
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
Join us for a developer workshop where we’ll go hands-on to explore the affinities between Rust, the Tokio framework, and ScyllaDB.
ScyllaDB is a perfect match for Rust. Similar to the Rust programming language and the Tokio framework, ScyllaDB is built on an asynchronous, non-blocking runtime that works extremely well for building highly-reliable low-latency distributed applications.
In this workshop, you’ll go live with our sample Rust application, built on our new, high performance native Rust client driver. By compiling and walking through the code, you’ll learn specifically how to craft queries to a locally running ScyllaDB cluster.
In the process you’ll discover the features and best practices that enable your Rust applications to squeeze maximum performance out of ScyllaDB's shard-per-core architecture.
- Install and compile an IoT sample app, built on ScyllaDB’s native Rust SDK.
- Install a single cluster of Scylla locally
- Use Docker to get a 3-node cluster running on your laptop
- Connect the application to the database
- Review data modeling, query types and best practices
- Manage and monitor
If you’re an application developer with an interest in Rust and Tokio, this workshop is for you!
Neuron is a server-less Deep Learning and AI experiment platform for analytics where you can build, deploy and visualise the data models.
Practical lab on cloud access from anywhere.
Jupyter Notebooks and Apache Spark are first class citizens of the Data Science space, a truly requirement for the "modern" data scientist. Now with Azure Synapse these two computing powers are available to the .NET Developer. And .NET is available for all data scientists. Let's look what .net can do for notebooks and spark inside Azure Synapse and what are Synapse, notebooks and spark.
This is our first version of the key products that have been used to offer services to our clients. We have about 30 tools mostly open source that are being used at our startup to develop minimum viable products
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceAlex Danvy
Nous assisterons probablement à une rupture générationnelle entre les apps avec de l'intelligence artificielle et celles sans. Ces dernières, comme les applications en mode caractères à l'arrivée des interfaces graphiques, auront du mal à perdurer.
Azure met à dispositions 3 approches pour ajouter de l'IA dans une app, avec un niveau de difficulté graduel, de l'outil ne nécessitant aucune compétence particulière à celui dédié aux Data Scientistes.
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightŁukasz Grala
Sesja or ozwiązaniu Big Data Analytics Microsoft. Jest to Hortonowrks (HADOOP, HBase, Storm, Spark), wraz z wydajnym R Server. Zaawansowana analityka przy użyciui RevoScaleR
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Microsoft Technologies for Data Science 201612Mark Tabladillo
Delivered to SQL Saturday BI Edition -- Atlanta, GA
Microsoft provides several technologies in and around Azure which can be used for casual to serious data science. This presentation provides an overview of the major Microsoft options for both on-premise and cloud-based data science (and hybrid). These technologies have been used by the presenter in various companies and industries, both as a Microsoft consultant and previously independent consultant. As well, the speaker provides insights into data science careers, information which helps imply where the business will likely be for consultants and partners.
Microsoft Power BI and Cortana Analytics user group meetings with AlteryxHåkan Söderbom
Introducing integration between Alteryx Designer and Microsoft Cortana analytics suite, including Power BI, Azure Machine Learning, SQL Server, SQL Data Warehouse, Microsoft R Server, MRS (Revolution).
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Microsoft is working hard to make Artificial Intelligence available to everyone. We not only infuse AI in our products but also give you the platform to build your very own solution, that you are a developer, a citizen data scientist or a hard core data scientist.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. Microsoft Azure ♥ R
Data Science with Microsoft Azure and R
Dmitry Petukhov,
Microsoft Data Platform MVP, C# MCP,
Big Data Enthusiast && Coffee Addicted
2. Microsoft Azure + R. Prototype to Product Challenge
Prototyping
Flexibility
Distributed
Scalable
Fault-tolerance
Reliable
Production
Flexibility
Distributed
Scalable
Fault-tolerance
Reliable
+ Big Data Ready
+ LSML
Black Magic!
Migration
3. Microsoft Azure + R. Hello R!
Python is a COOL language!
But R…
Specialized in statistical analyze
Time-effective => ideal for…
…prototyping, competition, researching, and for fun!
Standalone computing => not bad scalable
Open source
Big bearded community
4. Storage
Resource
Management
ML Framework
Execution
Engine
Local OS
Local Disc
PythonRuntime
YetAnother
Runtime
scikit
learn
HDFS
YARN
MapReduce
Mahout
HDFS / S3
YARN /
Apache Mesos
Spark
MLlib
HDFS / S3
YARN /
Apache Mesos
Python / R
on Spark
Python/R
tools
Spark
Local PC Hybrid Model Cluster (on-premises/on-demand)
some
library
Machine Learning in Finance. Infrastructure for Data Scientist
Low HighCost of deployment/ownership
Distributed
FS
Dark
Magic…
ML as a Service
Python/R
tools
Microsoft Azure + R. Infrastructures for Data Scientists
5. Microsoft Azure + R. Microsoft ♥ R
R Server for Azure HDInsight
Data Science VM
Azure Machine Learning
Support R-scripts execution
Allow authoring custom R modules
Jupyter Notebooks with R kernel support
Azure HDInsight
Hadoop/Spark-cluster as a Service
SQL Server R Services
Power BI
Running R Scripts & excellent visualization
R Tools for Visual Studio
Microsoft
Azure
8. R Server for Azure HDInsight
Killer features list:
100% open source R implementation;
workload running inside HDInsight (Hadoop/Spark).
Microsoft Azure + R. R Server for Azure HDInsight
9. R, Python, SQL, C#
Microsoft Azure + R. Data Science VM
Microsoft R Server Developer Edition,
Anaconda Python distribution,
Jupyter notebooks for Python and R,
Visual Studio Community Edition with Python and R Tools,
Power BI desktop,
SQL Server Express edition
ML libs: CNTK, xgboost and Vowpal Wabbit
Azure SDK
Data Science VM inside:
10. R Tools in Azure Machine Learning:
Support R-scripts execution;
Allow authoring custom R modules;
Jupyter Notebooks with R kernel support.
Microsoft Azure + R. Azure Machine Learning
11. Microsoft Azure + R. Azure Machine Learning
Jupyter
Notebook
Azure ML
Studio
GitHub/
TFS in Azure
h(θ0, θn)
Commands flow
Data flow
Request/response flow
12. References
Cortana Intelligence and Machine Learning Blog
R for Azure Machine Learning. Quickstart
Machine Learning Algorithm Cheat Sheet
Machine Learning Hackathon. How to win?
Azure ML Repositories on GitHub
Microsoft Azure for all group on Facebook
Soon in Slack (invite form)
Microsoft Azure + R. References
14. Q&A
Now or later (send on d.petukhov@outlook.com)
Ping me
Habr: @codezombie
LinkedIn: @dpetukhov
Facebook: @code.zombi
Read my tech code instinct blog ( http://0xCode.in/ )
Microsoft Azure + R. Stay in Touch!
Editor's Notes
Revolution Analytics
Revolution R Open и Revolution R Enterprise
Revolution R — это среда выполнения языка R (язык программирования для статистической обработки данных и работы с графикой), оптимизированная для многопоточных вычислений, а также, набор библиотек, для параллельной обработки в рамках концепции «больших данных».
R Server for Azure HDInsight is a 100% open source R implementation running the most comprehensive set of ML algorithms and statistical functions in the cloud that leverages Hadoop and Spark.
By making R Server available as a workload running inside HDInsight, we remove obstacles for users to unlock the power of R by eliminating memory and processing constraints and extending analytics from the laptop to large multi-node Hadoop and Spark clusters. This enables the ability to train and run ML models on larger datasets than previously possible to make more accurate predictions that affect the business. It also reduces the time to move ideas into production by eliminating the time-consuming installation or set up and procurement cycles for new hardware.