Federated Learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy, reduces network communication costs, and taps edge device computing resources. The principles of data minimization established by the GDPR, and the growing prevalence of smart sensors make the advantages of federated learning more compelling. Federated learning is a great fit for smartphones, industrial and consumer IoT, healthcare and other privacy-sensitive use cases, and industrial sensor applications.
We’ll present the Fast Forward Labs team’s research on this topic and the accompanying prototype application, “Turbofan Tycoon”: a simplified working example of federated learning applied to a predictive maintenance problem. In this demo scenario, customers of an industrial turbofan manufacturer are not willing to share the details of how their components failed with the manufacturer, but want the manufacturer to provide them with a strategy to maintain the part. Federated learning allows us to satisfy the customer's privacy concerns while providing them with a model that leads to fewer costly failures and less maintenance downtime.
We’ll discuss the advantages and tradeoffs of taking the federated approach. We’ll assess the state of tooling for federated learning, circumstances in which you might want to consider applying it, and the challenges you’d face along the way.
Speaker
Chris Wallace
Data Scientist
Cloudera
Federated Learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy, reduces network communication costs, and taps edge device computing resources. The principles of data minimization established by the GDPR, and the growing prevalence of smart sensors make the advantages of federated learning more compelling. Federated learning is a great fit for smartphones, industrial and consumer IoT, healthcare and other privacy-sensitive use cases, and industrial sensor applications.
We’ll present the Fast Forward Labs team’s research on this topic and the accompanying prototype application, “Turbofan Tycoon”: a simplified working example of federated learning applied to a predictive maintenance problem. In this demo scenario, customers of an industrial turbofan manufacturer are not willing to share the details of how their components failed with the manufacturer, but want the manufacturer to provide them with a strategy to maintain the part. Federated learning allows us to satisfy the customer's privacy concerns while providing them with a model that leads to fewer costly failures and less maintenance downtime.
We’ll discuss the advantages and tradeoffs of taking the federated approach. We’ll assess the state of tooling for federated learning, circumstances in which you might want to consider applying it, and the challenges you’d face along the way.
Speaker
Chris Wallace
Data Scientist
Cloudera
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Naive Bayes Classifier is a machine learning technique that is exceedingly useful to address several classification problems. It is often used as a baseline classifier to benchmark results. It is also used as a standalone classifier for tasks such as spam filtering where the naive assumption (conditional independence) made by the classifier seem reasonable. In this presentation we discuss the mathematical basis for the Naive Bayes and illustrate with examples
Data Visualization is widely used in industries in info-graphics design, business analytics, data analytics, advanced analytics, business intelligence dashboards, content marketing. It is the 1st part of 3 part series on data visualization. These techniques will enable you to create a good design UI/UX. It contains r codes useful for programmers to create good visual charts and depict a story to clients, customer, senior management, etc ...
Developing R Graphical User Interfaces, presented at
1. Workshop on Development of R software for data analysis, Hasselt University, Belgium, March 13th, 2013.
2. Joint Seminar, Medical Epidemiology and Biostatistics Department, Karolinska Institutet, April 4th, 2013.
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Naive Bayes Classifier is a machine learning technique that is exceedingly useful to address several classification problems. It is often used as a baseline classifier to benchmark results. It is also used as a standalone classifier for tasks such as spam filtering where the naive assumption (conditional independence) made by the classifier seem reasonable. In this presentation we discuss the mathematical basis for the Naive Bayes and illustrate with examples
Data Visualization is widely used in industries in info-graphics design, business analytics, data analytics, advanced analytics, business intelligence dashboards, content marketing. It is the 1st part of 3 part series on data visualization. These techniques will enable you to create a good design UI/UX. It contains r codes useful for programmers to create good visual charts and depict a story to clients, customer, senior management, etc ...
Developing R Graphical User Interfaces, presented at
1. Workshop on Development of R software for data analysis, Hasselt University, Belgium, March 13th, 2013.
2. Joint Seminar, Medical Epidemiology and Biostatistics Department, Karolinska Institutet, April 4th, 2013.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...Spark Summit
The Andromeda Galaxy, or M31, is a spiral galaxy approximately 2.5 million light years away from the Milky Way. As the nearest large external galaxy, it allows us to study galaxy features not visible in our own Milky Way due to our position within the galaxy. Recent studies have shown that the disc and halo of the Andromeda Galaxy extend further than previously thought. Rafiei Ravandi et al 2016 extended previous surveys of Andromeda at mid-infrared wvalengths to produce a catalog containing 426,529 objects. We have used the Apache Spark API for Python in order to cross correlate these objects with previous astronomical catalogs, such as SIMBAD, NED, and MAST (over 11 million objects). The aim is to know whether the objects from the new survey are all part of the M31 galaxy or are part of the background or foreground. The Spark-Python code makes full use of Spark RDDs in order to join multiple catalogs in a single table; this helps us to predict if a particular object is in fact part of Andromeda.
We used key-value pairs in order to reduce the data duplicate data from the MAST catalog, and using groupByKey, we can classify a particular astronomical object using previous catalogs. We can conclude that our new tool can help us to better understand multiple astronomical catalogs for the Andromeda galaxy, such as resolution between astronomical catalogs, and the region in the galaxy where the astronomical objects (such as X-ray binaries, or black holes) dwell.
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark Summit
Modern infrastructure and applications generate extraordinary volumes of log and telemetry data. At Pure Storage, we know this first hand: we have over 5PB of log data from production customers running our all-flash storage systems, from our engineering testbeds, and from test stations at manufacturing partners. Every part of our company — from engineering to sales — now depends on the insights we gather from this data. Given the diversity of our end users, it’s no surprise that our analysis tools comprise a broad mix of reporting queries, stream-processing operations, ad-hoc analyses, and deeper machine-learning algorithms. In this session, we will cover lessons learned from scaling our data warehouse and how we are leveraging Apache Spark’s capabilities as a central hub to meet our analytics demands.
Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...Spark Summit
Water Mission is a non-profit organization dedicated to bringing safe water solutions where needed around the world. The IBM jStart team is helping Water Mission to analyze water consumption data from Water Mission’s treatment stations in Uganda and Tanzania. Bluemix Apache Spark service is used to combine and correlate data with socio-economic and weather data, look for behavioral patterns, predict water shortages, and suggest changes to the way the stations are operated in order to increase delivery of safe drinking water within the local communities.
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
Cybercrime is big business. Gartner reports worldwide security spending at $80B, with annual losses totalling more than $1.2T (in 2015). Small to medium sized businesses now account for more than half of the attacks targeting enterprises today. The threat actors behind these attacks are continually shifting their techniques and toolkits to evade the security defenses that businesses commonly use. Thanks to the growing frequency and complexity of attacks, the task of identifying and mitigating security-related events has become increasingly difficult.
At eSentire, we use a combination of data and human analytics to identify, respond to and mitigate cyber threats in real-time. We capture all network traffic on our customers’ networks, hence ingesting a large amount of time-series data. We process the data as it is being streamed into our system to extract relevant threat insights and block attacks in real-time. Furthermore, we enable our cybersecurity analysts to perform in-depth investigations to: i) confirm attacks and ii) identify threats that analytical models miss. Having security experts in the loop provides feedback to our analytics engine, thereby improving the overall threat detection effectiveness.
So how exactly can you build an analytics pipeline to handle a large amount of time-series/event-driven data? How do you build the tools that allow people to query this data with the expectation of mission-critical response times?
In this presentation, William Callaghan will focus on the challenges faced and lessons learned in building a human-in-the loop cyber threat analytics pipeline. They will discuss the topic of analytics in cybersecurity and highlight the use of technologies such as Spark Streaming/SQL, Cassandra, Kafka and Alluxio in creating an analytics architecture with missions-critical response times.
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Spark Summit
Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another?
Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Spark Summit
Everybody agrees that IoT is changing the world… and creates new challenges for software developers, architects and DevOps. How can we build efficient and highly scalable distributed applications using open-source technologies? What are characteristics of data generated by IoT devices and how it differs from traditional enterprise or Big Data problems? Which architectural patterns are beneficial for IoT use cases and why some trusted methods eventually turn out to be “anti-patterns”? This talk will show how to combine best-of-breed open-source technologies, like Apache Spark, Riak and Mesos to build scalable IoT pipelines to ingest, store and analyze huge amounts of data, while keeping operational complexity and costs under control. We will discuss cons and pros of using relational, NoSQL and object storage products for storing and archiving IoT data. Then we cover best practices how to use Spark with Riak NoSQL database. Will describe how Apache Spark advanced modules (Spark SQL, Spark Streaming and MLlib) can solve the problems common to IoT apps, while using Riak for fast and scalable persistence. At the end, will explain why Structured Spark Streaming is a godsend for IoT data and make a case for Time Series databases deserving a separate category in NoSQL classification.
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
Scaling out doesn’t have to mean giving up transactions and efficient joins! Relational databases can scale horizontally, and using them as a store for Spark Streaming or batch computations can help cover areas in which Spark is typically weaker. Examples will be drawn from our experience using Citus (https://github.com/citusdata/citus), an open-source extension to Postgres, but lessons learned should be applicable to many databases.
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up their data processing. In many use cases, though, a PySpark job can perform worse than equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master. In this talk, we’ll examine some of the the data serialization and other interoperability issues, especially with Python libraries like pandas and NumPy, that are impacting PySpark performance and work that is being done to address them. This will relate closely to other work in binary columnar serialization and data exchange tools in development such as Apache Arrow and Feather files.
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Spark Summit
The Internet of Things (IoT) is a growing network, supporting a wide variety of service types with specific network requirements that differ from traditional human type communications. This has led to emergence of dedicated IoT network standards. To optimize investments for dedicated network infrastructures, we’re investigating a dynamic approach in network capacity planning to accommodate multiple IoT traffic types over a cellular network, while maintaining their specific requirements.
We studied models of IoT traffic and used machine learning in prediction and scheduling of future workload under heterogeneous and variable traffic conditions when human-type and machine-type communications are mixed.
An integrated analytics framework including Hadoop and Spark were deployed for experimentation and a number of capacity planning use cases were implemented to verify the accuracy of the method.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
This tutor introduces the basic idea of machine learning with a very simple example. Machine learning teaches machines (and me too) to learn to carry out tasks and concepts by themselves. It is that simple, so here is an overview:
http://www.softwareschule.ch/examples/machinelearning.jpg
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance, Feature ranking with recursive feature elimination, Two dimensional Linear Discriminant Analysis
Start machine learning in 5 simple stepsRenjith M P
Simple steps to get started with machine learning.
The use case uses python programming. Target audience is expected to have a very basic python knowledge.
In this article you will learn hot to use tensorflow Softmax Classifier estimator to classify MNIST dataset in one script.
This paper introduces also the basic idea of a artificial neural network.
Oracle Analytical Function Include First Value, Last Value, Lead, Lag, Nth Value with Unbounded and Difference between Rank and Dense Rank . Contain Rollup, Cube and Grouping and Different type of Window Function and Analytical Window frame
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AIData Science Milan
The talk will introduce Ludwig, a deep learning toolbox that allows to train models and to use them for prediction without the need to write code. It is unique in its ability to help make deep learning easier to understand for non-experts and enable faster model improvement iteration cycles for experienced machine learning developers and researchers alike. By using Ludwig, experts and researchers can simplify the prototyping process and streamline data processing so that they can focus on developing deep learning architectures.
Bio:
Piero Molino is a Senior Research Scientist at Uber AI with focus on machine learning for language and dialogue. Piero completed a PhD on Question Answering at the University of Bari, Italy. Founded QuestionCube, a startup that built a framework for semantic search and QA. Worked for Yahoo Labs in Barcelona on learning to rank, IBM Watson in New York on natural language processing with deep learning and then joined Geometric Intelligence, where he worked on grounded language understanding. After Uber acquired Geometric Intelligence, he became one of the founding members of Uber AI Labs.
This is an elaborate presentation on how to predict employee attrition using various machine learning models. This presentation will take you through the process of statistical model building using Python.
▪ Developed a recursive-descent parser to generate an intermediate representation for subsequent optimizations in Java
▪ Implemented common subexpression elimination and copy propagation on control flow graph
▪ Deployed a code generator for the source language that yields optimized native programs
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
3. Logistic Regression ModelLogistic Regression Model
• Logistic regression is a model used for prediction of
the probability of occurrence of an event. It makes
use of several predictor variables that may be either
numerical or categories.
• It is used to predict a binary response from a
binary predictor, used for predicting the outcome of
a categorical dependent variable (i.e., a class
label) based on one or more predictor variables
(features).
• Logistic regression is the standard industry workhorse
that underlies many production fraud detection and
advertising quality and targeting products.
4. MahoutMahout
• Apache Mahout is a project of the Apache
Software Foundation to produce free
implementations of distributed or otherwise scalable
machine learning algorithms focused primarily in
the areas of collaborative filtering, clustering and
classification.
• The advantage of Mahout over other approaches
becomes striking as the number of training
examples gets extremely large. What large means
can vary enormously.
• As the input exceeds 1 to 10 million training
examples, something scalable like Mahout is
needed.
5. Logistic Regression Using MahoutLogistic Regression Using Mahout
• Mahout’s implementation of Logistic regression uses
Stochastic Gradient Descent (SGD) algorithm
• This algorithm is a sequential (nonparallel) algorithm,
but it’s fast.
• While working with large data, the SGD algorithm
uses a constant amount of memory regardless of
the size of the input.
• Mahout includes a command line example of
logistic regression program.
• For production use, the logistic regression stuff
mostly is not run from the command line, but is
integrated more tightly into some data flow.
6. • Mahout’s implementation of Logistic Regression
using SGD supports the following command line
program names:
Valid program names are:
ocat : Print a file or resource as the logistic regression models would see it
orunlogistic : Run a logistic regression model against CSV data
otrainlogistic : Train a logistic regression using stochastic gradient descent
Logistic Regression Using MahoutLogistic Regression Using Mahout
7. BUILDING THE MODEL
You can build a model to determine the REQUIRED field from the selected features
with the
following command:
$ bin/mahout trainlogistic --input Input.csv
--output ./model
--target TARGETVARIABLE--categories 2
--predictors Predictor1 Predictor2 . . . --types numeric
--features 20 --passes 100 --rate 50
......
This command specifies that the input comes from the resource named Input.csv,
that the resulting model is stored in the file ./model, that the target variable is in the
field named TARGETVARIABLE and that it has two possible values. The command also
specifies that the algorithm should use variables x and y as predictors, both with
numerical types. The remaining options specify internal parameters for the learning
algorithm.
Logistic Regression Using MahoutLogistic Regression Using Mahout
8. Handle What it does
--quiet Produces less status and progress output.
--input <file-or-resource> Uses the specified file or resource as input.
--output <file-for-model> Puts the model into the specified file.
--target <variable> Uses the specified variable as the target.
--categories <n> Specifies how many categories the target variable has.
--predictors <v1> ... <vn> Specifies the names of the predictor variables.
--types
<t1> ... <tm> Gives a list of the types of the predictor variables. Each type should be one of numeric,
word, or text. Types can be abbreviated to their first letter. If too few types are given, the last one is
used again as necessary. Use word for categorical variables.
--passes
Specifies the number of times the input data should be reexamined during training. Small input files
may need to be examined dozens of times. Very large input files probably don’t even need to be
completely examined.
--lambda
Controls how much the algorithm tries to eliminate variables from the final model. A value of 0
indicates no effort is made. Typical values are on the order of 0.00001 or less.
--rate
Sets the initial learning rate. This can be large if you have lots of data or use lots of passes because it’s
decreased progressively as data is examined.
--noBias
Eliminates the intercept term (a built-in constant predictor variable) from the model. Occasionally this
is a good idea, but generally it isn’t since the SGD learning algorithm can usually eliminate the
intercept term if warranted.
--features
Sets the size of the internal feature vector to use in building the model. A larger value here can be
helpful, especially with text-like input data.
Logistic Regression Using MahoutLogistic Regression Using Mahout
Handles for Trainlogistic:
9. EVALUATING THE MODEL
You can evaluate the model by running it on the same training dataset as:
$ bin/mahout runlogistic --input test.csv --model ./model
--auc --confusion
......
This command takes as input the model created in our trainlogistic command
and produces the Area Under the Curve and Confusion Matrix.
for the learning algorithm.
Logistic Regression Using MahoutLogistic Regression Using Mahout
Handle What it does
--quiet Produces less status and progress output.
--auc Prints AUC score for model versus input data after reading data.
--scores Prints target variable value and scores for each input example.
--
confusion Prints confusion matrix for a particular threshold (see --threshold).
--input <input> Reads data records from specified file or resource.
--model <model> Reads model from specified file.
10. Demo - DatasetDemo - Dataset
• Let’s take an example data set:
"x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias"
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1
0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,...,1
. . .
0.67132937326096,0.571220482233912,23,1,5,2,0.450683127402953,...,1
0.548616112209857,0.405350996181369,24,1,5,3,0.300979638576258,...,1
0.677980388281867,0.993355110753328,25,2,3,9,0.459657406894831,...,1
• Now we will run our Logistic regression model to predict the value of
Color based on the predictors
"x","y","shape","k","k0","xx","xy","yy","a","b","c","bias"
11. Demo – Dataset ExplainedDemo – Dataset Explained
Variable Description Possible values
x The x coordinate of a point Numerical from 0 to 1
y The y coordinate of a point Numerical from 0 to 1
shape The shape of a point Shape code from 21 to 25
color Whether the point is filled or not 1=empty, 2=filled
k The k-means cluster ID derived using only x and y Integer cluster code from 1 to 10
k0
The k-means cluster ID derived using x, y, and color Integer
cluster code from 1 to 10
xx The square of the x coordinate Numerical from 0 to 1
xy The product of the x and y coordinates Numerical from 0 to 1
yy The square of the y coordinate Numerical from 0 to 1
a The distance from the point (0,0) Numerical from 0 to
b The distance from the point (1,0) Numerical from 0 to
c The distance from the point (0.5,0.5) Numerical from 0 to
bias A constant 1
12. Demo – Training the ModelDemo – Training the Model
• To determine the color field from the x and y features with the
following command:
$ bin/mahout trainlogistic --input color.csv
--output ./model
--target color --categories 2
--predictors x y --types numeric
--features 20 --passes 100 --rate 50
...
• After the model is trained, you can run the model on the training data again
to evaluate how it does (even though we know it won’t do well).
$ bin/mahout runlogistic --input donut.csv --model ./model
--auc --confusion
AUC = 0.57
confusion: [[27.0, 13.0], [0.0, 0.0]]
• AUC can range from 0 to 0.5 for a model that’s no better than random to 1 for
a perfect model.
• The value here of 0.57 indicates a model that’s hardly better than random.
13. Demo – Selecting FeaturesDemo – Selecting Features
• You can get more interesting results if you use additional variables for training.
For instance, this command allows the model to use x and y as well as a, b,
and c to build the model:
$ bin/mahout trainlogistic --input .colorcsv --output model
--target color --categories 2
--predictors x y a b c --types numeric
--features 20 --passes 100 --rate 50
• If you run this revised model on the training data, you’ll get a very different
result than with the previous model:
$ bin/mahout runlogistic --input color.csv --model model
--auc –confusion
Result:
AUC = 1.00
confusion: [[27.0, 0.0], [0.0, 13.0]]
entropy: [[-0.1, -1.5], [-4.0, -0.2]]
• Now you have a perfect value of 1 for AUC, and you can see from the
confusion matrix that the model was able to classify all of the training
examples perfectly.
14. Demo – EvaluationDemo – Evaluation
• You can get more interesting results if you use additional variables for
training. Now you can run this same model on additional data in the
donut-test.csv resource.
• Because this new data was not used to train the model, there are a
few surprises in store for the model.
$ bin/mahout runlogistic --input color-test.csv --model model
--auc –confusion
Result
AUC = 0.97
confusion: [[24.0, 2.0], [3.0, 11.0]]
entropy: [[-0.2, -2.8], [-4.1, -0.1]]
• On this withheld data, you can see that the AUC has dropped to
0.97, which is still quite good.