This document provides an introduction to the Julia programming language and demonstrates its use for statistical computing and analysis. It summarizes Julia's capabilities for technical computing and compares its performance to R, Python, and other languages. Examples showing Julia's speed advantages include bootstrap analysis of correlation coefficients and MCMC sampling for Bayesian logistic regression. Overall, Julia is shown to be significantly faster than R and other options for many statistical tasks.
This document discusses R and Julia for data analysis and advanced analytics. It provides an overview of R's history, how it works, performance improvements, and use in production. Julia is introduced as a new high-performance dynamic language with similarities to R but faster performance due to its just-in-time compiler and type information. Examples are given comparing the performance of Julia to other languages. The document recommends Julia for those already using C/Fortran and suggests it will be useful for R users once fully developed.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
BASIC was originally created in 1963 as a teaching language to simplify programming. It has influenced computer science education and raised the need for coding knowledge. R is a free statistical programming language used for data analysis, modeling, and visualization. It includes many statistical and machine learning methods. UNIX was developed in the late 1960s and became widely used, while Linux is an open-source OS inspired by UNIX. Both operate using commands in a terminal rather than a graphical user interface.
R originated in the 1970s at Bell Labs and has since evolved significantly. It is an open-source programming language used widely for statistical analysis and graphics. While powerful, R has some drawbacks like poor performance for large datasets and a steep learning curve. However, its key advantages including being free, having a large community of users, and extensive libraries have made it a popular tool, especially for academic research.
A presentation on the history, design, and use of R. The talk will focus on companies that use and support R, use cases, where it is going, competitors, advantages and disadvantages, and resources to learn more about R. Speaker Bio
Joseph Kambourakis has been the Lead Data Science Instructor at EMC for over two years. He has taught in eight countries and been interviewed by Japanese and Saudi Arabian media about his expertise in Data Science. He holds a Bachelors in Electrical and Computer Engineering from Worcester Polytechnic Institute and an MBA from Bentley University with a concentration in Business Analytics.
This document discusses R and Julia for data analysis and advanced analytics. It provides an overview of R's history, how it works, performance improvements, and use in production. Julia is introduced as a new high-performance dynamic language with similarities to R but faster performance due to its just-in-time compiler and type information. Examples are given comparing the performance of Julia to other languages. The document recommends Julia for those already using C/Fortran and suggests it will be useful for R users once fully developed.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
BASIC was originally created in 1963 as a teaching language to simplify programming. It has influenced computer science education and raised the need for coding knowledge. R is a free statistical programming language used for data analysis, modeling, and visualization. It includes many statistical and machine learning methods. UNIX was developed in the late 1960s and became widely used, while Linux is an open-source OS inspired by UNIX. Both operate using commands in a terminal rather than a graphical user interface.
R originated in the 1970s at Bell Labs and has since evolved significantly. It is an open-source programming language used widely for statistical analysis and graphics. While powerful, R has some drawbacks like poor performance for large datasets and a steep learning curve. However, its key advantages including being free, having a large community of users, and extensive libraries have made it a popular tool, especially for academic research.
A presentation on the history, design, and use of R. The talk will focus on companies that use and support R, use cases, where it is going, competitors, advantages and disadvantages, and resources to learn more about R. Speaker Bio
Joseph Kambourakis has been the Lead Data Science Instructor at EMC for over two years. He has taught in eight countries and been interviewed by Japanese and Saudi Arabian media about his expertise in Data Science. He holds a Bachelors in Electrical and Computer Engineering from Worcester Polytechnic Institute and an MBA from Bentley University with a concentration in Business Analytics.
This document provides an introduction to R, including what R is, how to install and use it, common mistakes, and data structures. It discusses that R was created by Ross Ihaka and Robert Gentleman and contains over 10,000 user-developed packages on topics like statistics, graphics, and data analysis. It also provides instructions on installing R from its homepage or a Italian download site, using the R console and R Studio interfaces, the workspace environment, and saving workspaces to preserve data between sessions.
This short text will get you up to speed in no time on creating visualizations using R's ggplot2 package. It was developed as part of a training to those who had no prior experience in R and had limited knowledge on general programming concepts. It's a must have initial guide for those exploring the field of Data Science
This document provides an introduction to using R Studio for statistical analysis. It discusses how to install both R and R Studio on Windows and Mac systems. It then covers creating scripts and files in R Studio, basic R syntax including assigning values to variables, vectors, and strings. The document also demonstrates how to install and load packages to access additional functions, and how to access built-in datasets to practice working with data in R.
It is one of the Best Presentation on the topic "R Programming" having interesting Slides consisting of Amazing Images & Very Useful Information. It also have Transitions & Animation which makes the Presentation more Interesting & Attractive.
Created By - Abhishek Pratap Singh (Aps)
R is a programming language and software environment for statistical analysis and graphics. It originated from S, a statistical programming language developed in the 1970s. R was first released in 1993 and has since grown in popularity due to its ability to run on Linux, Windows and Mac operating systems. It allows users to contribute additional packages to extend its functionality. Getting help in R can be obtained through manuals, online searches, and mailing lists. R has a command line interface but various graphical user interfaces and integrated development environments are also available. Everything in R is an object that has a class and methods, with common functions to define classes, create objects, and extract object elements.
R is a programming language developed as an alternative for S at AT&T Bell Laboratories. It excels at statistical computation and graphic visualization. R is free, open source, and available across platforms. It has over 3,000 packages on CRAN that extend its functionality. R has a steep learning curve and working with large datasets is limited by RAM size. Major companies use R in business.
This document provides an introduction to using R for data science and analytics. It discusses what R is, how to install R and RStudio, statistical software options, and how R can be used with other tools like Tableau, Qlik, and SAS. Examples are given of how R is used in government, telecom, insurance, finance, pharma, and by companies like ANZ bank, Bank of America, Facebook, and the Consumer Financial Protection Bureau. Key statistical concepts are also refreshed.
R is a programming language for statistical analysis and graphics. It is an open-source language developed by statisticians to allow for easy statistical analysis and visualization of data. The document provides an overview of R, discussing its origins, functionality, uses in data science, and popular packages and IDEs used with R. Examples are given of basic R syntax for vectors, matrices, data frames, plotting, and applying functions to data.
this presentation is an introduction to R programming language.we will talk about usage, history, data structure and feathers of R programming language.
These Lecture series are relating the use R language software, its interface and functions required to evaluate financial risk models. Furthermore, R software applications relating financial market data, measuring risk, modern portfolio theory, risk modeling relating returns generalized hyperbolic and lambda distributions, Value at Risk (VaR) modelling, extreme value methods and models, the class of ARCH models, GARCH risk models and portfolio optimization approaches.
This document contains a series of exercises to assess conceptual learning of the Java programming language. It includes exercises on primitive data types like short, int, double, and char. Exercises explore assigning values, arithmetic operators, trigonometry functions, converting between degrees and radians, creating and using String objects, and computing string lengths. The exercises are meant to test understanding of basic Java concepts and to find errors in programs by compiling code with invalid values or missing elements.
This document presents an Integrative Model for Parallelism (IMP) that aims to provide a unified treatment of different types of parallelism. It describes the key concepts of the IMP including the programming model using sequential semantics, the execution model using a data flow virtual machine, and the data model using distributions to describe data placement. It demonstrates the IMP concepts using a motivating example of 3-point averaging and discusses tasks, processes, and research opportunities around the IMP approach.
C & C++ Training Centre in Ambala! BATRA COMPUTER CENTREjatin batra
Are you in search of C & C++ Training in Ambala? Now your search ends here.. BATRA COMPUTER CENTRE provides best training in:Basics of Computer, HTML,PHP,WebDesigning
Web Development , SEO, SMO and So many other courses are available here.
Basic tutorial for R programming. this video contains lot of information about r programming like
agenda
history
SOFTWARE PARADIGM
R interface
advantages of r
drawbacks of r
The document discusses the history and evolution of programming languages from the 1940s to present. It notes that early languages provided little abstraction from computer hardware, but that over time languages increasingly abstracted complexity and improved developer productivity. The document outlines the development of assembly languages, third generation languages like FORTRAN, and more modern paradigms like object-oriented programming. It also discusses influential ideas like structured programming and the "GOTO controversy" that aimed to improve programming practices.
This document provides an introduction to the SciPy Python library and its uses for scientific computing and data analysis. It discusses how SciPy builds on NumPy to provide functions for domains like linear algebra, integration, interpolation, optimization, statistics, and more. Examples are given of using SciPy for tasks like LU decomposition of matrices, sparse linear algebra, single and double integrals, line plots, and statistics. SciPy allows leveraging Python's simplicity for technical applications involving numerical analysis and data manipulation.
R is a widely used statistical programming language and software environment for statistical analysis and graphics. It includes over 6,700 packages and was originally based on S, which was developed in the 1970s. RStudio is a popular integrated development environment for R that provides a simpler interface. R supports object-oriented programming and there are many ways to perform the same tasks in R, such as calculating statistics, building models, and creating visualizations of data.
This document discusses effective data visualization using Microsoft R Open. It introduces why visualizing data is important, the benefits of using R and Microsoft R Open for data visualization, and how to get started. The document then provides examples of different types of graphs that can be created in R using the base graphics package as well as several other packages. It discusses choosing an appropriate graphics package and considerations for sizing and saving graphs. The examples focus on direct comparisons, distributions, trends over time, relationships, percentages, and special cases. Code and data are provided in appendices and online for reproducing the graphs.
The presentation is a brief case study of R Programming Language. In this, we discussed the scope of R, Uses of R, Advantages and Disadvantages of the R programming Language.
R is a programming language and free software environment for statistical analysis and graphics. It is widely used among statisticians and data scientists for developing statistical software and data analysis. Some key facts about R:
- It was created in the 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland.
- R can be used for statistical computing, machine learning, graphical display, and other tasks related to data analysis.
- It runs on Windows, Linux, and MacOS operating systems. Code written in R is cross-platform.
- R has a large collection of statistical and graphical techniques built-in, and users can extend its capabilities by downloading additional packages.
- Major
This document provides an introduction to R, including what R is, how to install and use it, common mistakes, and data structures. It discusses that R was created by Ross Ihaka and Robert Gentleman and contains over 10,000 user-developed packages on topics like statistics, graphics, and data analysis. It also provides instructions on installing R from its homepage or a Italian download site, using the R console and R Studio interfaces, the workspace environment, and saving workspaces to preserve data between sessions.
This short text will get you up to speed in no time on creating visualizations using R's ggplot2 package. It was developed as part of a training to those who had no prior experience in R and had limited knowledge on general programming concepts. It's a must have initial guide for those exploring the field of Data Science
This document provides an introduction to using R Studio for statistical analysis. It discusses how to install both R and R Studio on Windows and Mac systems. It then covers creating scripts and files in R Studio, basic R syntax including assigning values to variables, vectors, and strings. The document also demonstrates how to install and load packages to access additional functions, and how to access built-in datasets to practice working with data in R.
It is one of the Best Presentation on the topic "R Programming" having interesting Slides consisting of Amazing Images & Very Useful Information. It also have Transitions & Animation which makes the Presentation more Interesting & Attractive.
Created By - Abhishek Pratap Singh (Aps)
R is a programming language and software environment for statistical analysis and graphics. It originated from S, a statistical programming language developed in the 1970s. R was first released in 1993 and has since grown in popularity due to its ability to run on Linux, Windows and Mac operating systems. It allows users to contribute additional packages to extend its functionality. Getting help in R can be obtained through manuals, online searches, and mailing lists. R has a command line interface but various graphical user interfaces and integrated development environments are also available. Everything in R is an object that has a class and methods, with common functions to define classes, create objects, and extract object elements.
R is a programming language developed as an alternative for S at AT&T Bell Laboratories. It excels at statistical computation and graphic visualization. R is free, open source, and available across platforms. It has over 3,000 packages on CRAN that extend its functionality. R has a steep learning curve and working with large datasets is limited by RAM size. Major companies use R in business.
This document provides an introduction to using R for data science and analytics. It discusses what R is, how to install R and RStudio, statistical software options, and how R can be used with other tools like Tableau, Qlik, and SAS. Examples are given of how R is used in government, telecom, insurance, finance, pharma, and by companies like ANZ bank, Bank of America, Facebook, and the Consumer Financial Protection Bureau. Key statistical concepts are also refreshed.
R is a programming language for statistical analysis and graphics. It is an open-source language developed by statisticians to allow for easy statistical analysis and visualization of data. The document provides an overview of R, discussing its origins, functionality, uses in data science, and popular packages and IDEs used with R. Examples are given of basic R syntax for vectors, matrices, data frames, plotting, and applying functions to data.
this presentation is an introduction to R programming language.we will talk about usage, history, data structure and feathers of R programming language.
These Lecture series are relating the use R language software, its interface and functions required to evaluate financial risk models. Furthermore, R software applications relating financial market data, measuring risk, modern portfolio theory, risk modeling relating returns generalized hyperbolic and lambda distributions, Value at Risk (VaR) modelling, extreme value methods and models, the class of ARCH models, GARCH risk models and portfolio optimization approaches.
This document contains a series of exercises to assess conceptual learning of the Java programming language. It includes exercises on primitive data types like short, int, double, and char. Exercises explore assigning values, arithmetic operators, trigonometry functions, converting between degrees and radians, creating and using String objects, and computing string lengths. The exercises are meant to test understanding of basic Java concepts and to find errors in programs by compiling code with invalid values or missing elements.
This document presents an Integrative Model for Parallelism (IMP) that aims to provide a unified treatment of different types of parallelism. It describes the key concepts of the IMP including the programming model using sequential semantics, the execution model using a data flow virtual machine, and the data model using distributions to describe data placement. It demonstrates the IMP concepts using a motivating example of 3-point averaging and discusses tasks, processes, and research opportunities around the IMP approach.
C & C++ Training Centre in Ambala! BATRA COMPUTER CENTREjatin batra
Are you in search of C & C++ Training in Ambala? Now your search ends here.. BATRA COMPUTER CENTRE provides best training in:Basics of Computer, HTML,PHP,WebDesigning
Web Development , SEO, SMO and So many other courses are available here.
Basic tutorial for R programming. this video contains lot of information about r programming like
agenda
history
SOFTWARE PARADIGM
R interface
advantages of r
drawbacks of r
The document discusses the history and evolution of programming languages from the 1940s to present. It notes that early languages provided little abstraction from computer hardware, but that over time languages increasingly abstracted complexity and improved developer productivity. The document outlines the development of assembly languages, third generation languages like FORTRAN, and more modern paradigms like object-oriented programming. It also discusses influential ideas like structured programming and the "GOTO controversy" that aimed to improve programming practices.
This document provides an introduction to the SciPy Python library and its uses for scientific computing and data analysis. It discusses how SciPy builds on NumPy to provide functions for domains like linear algebra, integration, interpolation, optimization, statistics, and more. Examples are given of using SciPy for tasks like LU decomposition of matrices, sparse linear algebra, single and double integrals, line plots, and statistics. SciPy allows leveraging Python's simplicity for technical applications involving numerical analysis and data manipulation.
R is a widely used statistical programming language and software environment for statistical analysis and graphics. It includes over 6,700 packages and was originally based on S, which was developed in the 1970s. RStudio is a popular integrated development environment for R that provides a simpler interface. R supports object-oriented programming and there are many ways to perform the same tasks in R, such as calculating statistics, building models, and creating visualizations of data.
This document discusses effective data visualization using Microsoft R Open. It introduces why visualizing data is important, the benefits of using R and Microsoft R Open for data visualization, and how to get started. The document then provides examples of different types of graphs that can be created in R using the base graphics package as well as several other packages. It discusses choosing an appropriate graphics package and considerations for sizing and saving graphs. The examples focus on direct comparisons, distributions, trends over time, relationships, percentages, and special cases. Code and data are provided in appendices and online for reproducing the graphs.
The presentation is a brief case study of R Programming Language. In this, we discussed the scope of R, Uses of R, Advantages and Disadvantages of the R programming Language.
R is a programming language and free software environment for statistical analysis and graphics. It is widely used among statisticians and data scientists for developing statistical software and data analysis. Some key facts about R:
- It was created in the 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland.
- R can be used for statistical computing, machine learning, graphical display, and other tasks related to data analysis.
- It runs on Windows, Linux, and MacOS operating systems. Code written in R is cross-platform.
- R has a large collection of statistical and graphical techniques built-in, and users can extend its capabilities by downloading additional packages.
- Major
R is a programming language and software environment for statistical analysis and graphics. It was created by Ross Ihaka and Robert Gentleman in the early 1990s at the University of Auckland, New Zealand. Some key points:
- R can be used for statistical computing, machine learning, and data analysis. It is widely used among statisticians and data scientists.
- It runs on Windows, Mac OS, and Linux. The source code is published under the GNU GPL license.
- Popular companies like Facebook, Google, Microsoft, Uber and Airbnb use R for data analysis, machine learning, and statistical computing.
- R has a variety of data structures like vectors, matrices, arrays, lists
Rstudio is an integrated development environment for R that allows users to i...SWAROOP KUMAR K
R is a widely used statistical programming language and software environment for statistical analysis and graphics. It includes over 6,700 packages and was originally based on S, which was developed at Bell Labs in the 1970s. RStudio is a popular integrated development environment for R that provides a simpler interface compared to using R alone. R is object-oriented and there are often multiple ways to perform the same task.
R is an open source statistical programming language developed from S at the University of Auckland in 1993. It is dynamically typed and treats vectors as first-class objects. Functions in R are also objects that can be assigned to variables. R has various options for binding scalars and vectors together into arrays and data frames for aggregate analysis. It also includes many built-in functions for numerical, statistical, and character manipulation of data.
This document discusses Julia, a programming language for technical computing that is similar to R but aims to provide both fast development and fast execution. It provides an overview of Julia's features, compares it to R, and gives an example of implementing a simple Gibbs sampler in both R and Julia. Julia code runs much faster than equivalent R code due to its just-in-time compilation, and it can also distribute computations across multiple processors. The document demonstrates Julia's syntax and shows how to define functions, templated methods, and new data types.
Without analytics on big data, companies are unable to understand their environment and customers, similar to how deer cannot see or hear approaching vehicles on a highway. Presentations are tools that can be used for lectures, reports, and more. They serve various purposes, making presentations powerful tools for convincing and teaching others. Data science uses techniques from multiple fields like mathematics, statistics, and computer science to analyze large amounts of data and extract meaningful insights for business.
This document provides an overview of the R programming language and environment. It discusses why R is useful, outlines its interface and workspace, describes how to access help and tutorials, install packages, and input/output data. The interactive nature of R is highlighted, where results from one function can be used as input for another.
Python is the choice llanguage for data analysis,
The aim of this slide is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of the steps you need to learn to use Python for data analysis.
R is a free and open-source programming language and software environment for statistical analysis, graphics, and statistical computing. It was originally developed in the 1990s at Bell Laboratories by statisticians John Chambers and colleagues. Key points about R include that it is an interpreted language, supports functional programming, and is object-oriented. R can be used for tasks like statistical analysis, data visualization, and machine learning. It has a large community of users and developers contributing packages for specialized analysis techniques.
The document discusses key concepts related to data structures and algorithms in C including:
1. Data structures allow for efficient storage and retrieval of data through logical organization and mathematical modeling.
2. Algorithms must be correct, finite, and efficient to solve problems by taking input and producing output through a defined sequence of steps.
3. Common data structures covered include arrays, stacks, queues, linked lists, trees, and graphs. Abstract data types allow separation of implementation from interface.
The document discusses key concepts related to data structures and algorithms in C including:
1. Data structures allow for efficient storage and retrieval of data through logical organization and mathematical modeling.
2. Algorithms must be correct, finite, and efficient to solve problems by taking input and producing output through a defined sequence of steps.
3. Common data structures covered include arrays, stacks, queues, linked lists, trees, and graphs. Abstract data types allow separation of implementation from interface.
This document discusses R programming and compares it to Python. R is an open-source programming language commonly used for statistical analysis and visualization. It has many libraries that enable data analysis and machine learning. The document compares key aspects of R and Python, such as their creators, release years, software environments, usability, and pros and cons. It concludes that R is easy to learn and offers powerful graphics and statistical techniques through libraries, making it well-suited for data analysis applications.
The document discusses Python interview questions and answers related to Python fundamentals like data types, variables, functions, objects and classes. Some key points include:
- Python is an interpreted, interactive and object-oriented programming language. It uses indentation to identify code blocks rather than brackets.
- Python supports dynamic typing where the type is determined at runtime. It is strongly typed meaning operations inappropriate for a type will fail with an exception.
- Common data types include lists (mutable), tuples (immutable), dictionaries, strings and numbers.
- Functions use def, parameters are passed by reference, and variables can be local or global scope.
- Classes use inheritance, polymorphism and encapsulation to create
This document discusses visualizing data in R using various packages and techniques. It introduces ggplot2, a popular package for data visualization that implements Wilkinson's Grammar of Graphics. Ggplot2 can serve as a replacement for base graphics in R and contains defaults for displaying common scales online and in print. The document then covers basic visualizations like histograms, bar charts, box plots, and scatter plots that can be created in R, as well as more advanced visualizations. It also provides examples of code for creating simple time series charts, bar charts, and histograms in R.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
1. Have you met Julia?
Tommaso Rigon
May 2, 2016
Tommaso Rigon Have you met Julia? May 2, 2016 1 / 25
2. Introduction
Which software are we more likely to use?
A non comprehensive list
In statistics many programming languages can be used. One could use:
1 C / Fortran.
1 Low-level programming languages.
2 General purpose languages, but very efficient for numeric computing.
2 Python
1 Open source and general purpose language.
2 Widespread in industry and among computer scientists.
3 Matlab
1 Closed source (!)
2 Optimized for numerical computing, fast and clear linear algebra.
4 R
1 Open source: a lot of additional statistical packages are available.
2 R is developed by statisticians for statisticians.
3 Widely spread among academics.
Tommaso Rigon Have you met Julia? May 2, 2016 2 / 25
3. Introduction
A typical workflow in R
Suppose we are going to analyze a real dataset:
1 Data managemenent. We need to read the data in R, from a textual file or
from a database. We also need to arrange them in a convenient form (the
dplyr package is awesome!).
2 Data visualization. Visualize the data (See package ggplot2).
3 Statistical Modeling. First analyses are done using available packages.
4 Developing. We need to implement our new methodology.
5 Reporting. We need to communicate effectively our results, usually with
tables and graphs. (See Markdown and Knitr projects).
But...
1 The script is quickly developed in R, but it is often (very) slow. Sometimes
this precludes the use of the whole dataset.
2 The slow parts need to be written in C or Fortran and then interfaced to R
(package Rcpp helps!)
Tommaso Rigon Have you met Julia? May 2, 2016 3 / 25
4. Introduction
R is great but...
A vectorized language
1 The language encourages operating on the whole object (i.e. vectorized
code). However, some tasks (e.g. MCMC) can not be easily vectorized.
2 Unvectorized R code (for and while loops) is slow.
A nested for cycle compared to the same vectorized operation
system.time(for (i in 1:10^4) for (j in 1:10^3) runif(1))
# user system elapsed
# 17.689 0.000 17.528
system.time(runif(10^7))
# user system elapsed
# 0.410 0.000 0.424
Tommaso Rigon Have you met Julia? May 2, 2016 4 / 25
5. Introduction
What is Julia?
Julia according to its developer
Julia is a high-level, high-performance dynamic programming language for
technical computing, with syntax that is familiar to users of other technical
computing environments.
Julia in a nutshell
Julia was released recently, in 2012, by Jeff Bezanson, Stefan Karpinski, Viral
Shah, Alan Edelman. The last stable version is the 0.4.5.
1 Open-source, with MIT liberal license.
2 High-level and familiar. It can work on the level of vectors, matrices, arrays.
The syntax is similar to Matlab and R and easy to read without a huge effort.
3 Technical computing. It is specifically optimized for scientific computing,
not necessarily statistics.
Tommaso Rigon Have you met Julia? May 2, 2016 5 / 25
6. Introduction
Why Julia?
Julia in a nutshell - Technical details
1 Julia is REPL (read–eval–print loop). Exactly as in R, it is possible to
interact with the software, facilitating debugging, testing and developing.
Conversely, languages like C are usually ECRL (edit-compile-run loop).
2 Based on a sophisticated compiler which is JIT (Just in time) and LLVM
based.
3 Julia is fast. Its compiler is designed to approach the speed of C.
4 No need to vectorize code for performance; devectorized code is fast.
5 Efficient support for Unicode, including but not limited to UTF-8. It means,
for instance, that
µ = 10; σ = 5
are legitimate assignments in Julia.
6 Designed for parallelism and distributed computation
Tommaso Rigon Have you met Julia? May 2, 2016 6 / 25
7. Introduction
Packages available
Julia for statistics
These are some useful packages for statistical computing
1 Distributions. Probability distributions and associated functions (similar but
not equal to the d-p-q-r system in R).
2 DataFrames. For handling datasets, having eventually missing values.
3 GLM. Generalized linear models, including linear model.
4 StatBase. Basic descriptive function: sample mean, median, sample
variance...
5 ...and many others!
Julia and R integration
1 rjulia. An R package that calls Julia functions and import / export objects
between the two environments. Currently under development, available only
on GitHub.
2 RCall. As the name itself suggests, it calls R from Julia.
Tommaso Rigon Have you met Julia? May 2, 2016 7 / 25
8. Introduction
Speeding up computations
Do we really need such fast and powerful tools?
In many cases, we do not. Suppose our implementation is bad written and
inefficient, but it takes about 1 second to be executed. Does it worth to improve
the code?
Where efficient computation are really necessary?
Just to mention some areas among others:
1 In almost any procedure applied to huge datasets (even linear models!)
2 In any procedures which involves Cross-Validation (Both Lasso and CART
use often CV for model selection).
3 In “boostrap like” procedures (bootstrap, bagged trees,...).
4 In Bayesian statistics, in approximating the posterior distribution through
simulations (e.g. MCMC, Importance sampling, ABC,...).
5 A combination of the previous.
Tommaso Rigon Have you met Julia? May 2, 2016 8 / 25
9. Bootstrap example
Bootstrap example
What is the bootstrap? A (very) brief explanation
1 It is inferential technique, which (usually!) makes use of simulation. For
instance, in a frequentist framework, it can be used for assessing confidence
intervals.
2 Let ˆθ(Y ) be an estimator of θ and Yi ∼ F i.i.d. random vectors, for
i = 1, . . . , n. The “true” c.d.f. F is replaced with an estimate ˆF. Then, we
simulate Y ∗r
from ˆF for r = 1, . . . , R and we get
ˆθ∗
1 , . . . , ˆθ∗
R , where ˆθ∗
r = ˆθ(Y ∗r
)
which is a bootstrap sample of the estimator.
3 The bootstrap sample can be used to make inference on θ. This is usually
the main goal, but now we are mainly interested in fastly simulating it.
Tommaso Rigon Have you met Julia? May 2, 2016 9 / 25
10. Bootstrap example
Inference on the correlation coefficient
An example, using the “cars” dataset
I have considered the dataset cars available in R. Suppose that Yi = (Y1i , Y2i )
are i.i.d. I would like to make inference on the correlation coefficient
ˆρ =
ˆCov(Y1, Y2)
ˆVar(Y1) ˆVar(Y2)
,
using the so called non parametric boostrap, that is, ˆF is replaced by the empirical
distribution function. This operation can be vectorized and has been done both
in Julia and in R, for comparison.
In practice...
We need to “resample” the original data, with replacement. Then, we evaluate
the correlation coefficient for each bootstrap sample.
Tommaso Rigon Have you met Julia? May 2, 2016 10 / 25
11. Bootstrap example
Implementation
Listing 1: Bootstrap; R implementation
rho_boot <- function(R,dataset){
n <- NROW(dataset)
# Sampling the indexes
index <- matrix(sample(1:n,R*n,replace=TRUE),R,n)
# Bootstrap correlation estimate
apply(index,1,function(x) cor(dataset[x,1],dataset[x,2]) )
}
Listing 2: Bootstrap; Julia implementation
function rho_boot(R,dataset)
n = size(data)[1]
# Sampling the indexes
index = rand(1:n,n,R)
out = Array(Float64,R)
for i in 1:R
# Bootstrap correlation estimate
out[i] = cor(dataset[index[:,i],:])[1,2]
end
out
end
Tommaso Rigon Have you met Julia? May 2, 2016 11 / 25
12. Bootstrap example
Performance - Milliseconds in log-scale
Naive_R Julia Boot_library Boot_library_cor2
501002005001000
Expression
log(time)[t]
Tommaso Rigon Have you met Julia? May 2, 2016 12 / 25
13. Bootstrap example
Global Performance
And the winner is...
1 The R code is vectorized and therefore we expect a good performance.
2 Despite this, for this particular problem Julia ≈ 10 times faster than R. This
is true even if we use boot package.
Speeding up the R code
The bottleneck of the calculations is the R cor function. It is designed for the
evaluation of an entire correlation matrix. It also check for missing values before
performing the calculation. Therefore, we can easily improve the code defining the
function cor2. Now, Julia ≈ 5 times faster than R.
cor2 <- function(x,y) {
xbar <- x-mean(x)
ybar <- y-mean(y)
sum(xbar*ybar)/sqrt(sum(xbar^2)*sum(ybar^2))
}
Tommaso Rigon Have you met Julia? May 2, 2016 13 / 25
14. Bootstrap example
Bootstrap final result
0.0
2.5
5.0
7.5
0.6 0.7 0.8 0.9
Correlation coefficient
Estimatedbootstrapdensity
Tommaso Rigon Have you met Julia? May 2, 2016 14 / 25
15. Bootstrap example
Principal component analysis
Notation about PCA
Let yi = (yi1, . . . , yip) for i = 1, . . . , n, be i.i.d realizations from a random vector
having covariance matrix Σ. Let ˆΣ be the sample variance and R the related
correlation matrix. The spectral decomposition of R is denoted as follow
R = GΛGT
, Λ = Diag(λ1, . . . , λp), λ1 > λ2 > · · · > λp
The quantity interest
The quantity of interest is the the cumulative percentage of the "total variance"
explained by the first k principal components:
ˆτk =
k
i=1 λi
p
i=1 λi
=
1
p
k
i=1
λi
Tommaso Rigon Have you met Julia? May 2, 2016 15 / 25
16. Bootstrap example
The iris dataset
A famous example
The iris dataset was considered just for illustrative purpose. We would like to
assess the variability of the quantity
ˆτ1 =
λ1
p
,
using a non parametric bootstrap approach. The quantity ˆτ1 is the relative
importance of the first principal component. Without the bootstrap, it would be
difficult to assess the variability of this estimate. Also, notice that we are not
assuming a specific parametric family of distribution for Y .
Tommaso Rigon Have you met Julia? May 2, 2016 16 / 25
17. Bootstrap example
Implementation
function tau_est(data)
R = cor(data)
lambda = eigvals(R)
tau = (lambda/sum(lambda))[end] # Also lambda[end]/p is fine
tau
end
function pca_boot(R,data)
n = size(data)[1]
index = rand(1:n,n,R)
out = Array(Float64,R)
for i in 1:R
out[i] = tau_est(data[index[:,i],:])
end
out
end
Tommaso Rigon Have you met Julia? May 2, 2016 17 / 25
18. Bootstrap example
Bootstrap final result
0
5
10
15
20
25
0.70 0.74 0.78
Explained variance
Estimatedbootstrapdensity
Tommaso Rigon Have you met Julia? May 2, 2016 18 / 25
19. Bayesian statistics with Julia
A Bayesian logistic regression
The “shuttle” dataset
I have considered the famous “shuttle” dataset, having sample size n = 23. We
suppose the following Bayesian logistic regression:
Yi ∼ Bin(6, θi ), θi =
1
1 + e−ηi
, ηi = β0 + β1xi ,
where xi are a known constants and i = 1, . . . , n. Moreover let βj ∼ N(0, σ2
µ),
j = 1, 2, be the prior distributions and σ2
µ an hyperparameter.
MCMC posterior computation
I have approximated the posterior distribution of β | Y using a Metropolis
algorithm. I have used a multivariate Gaussian random walk as proposal
distribution, having covariance matrix equal to the observed information.
Tommaso Rigon Have you met Julia? May 2, 2016 19 / 25
20. Bayesian statistics with Julia
First step: the log-posterior
Listing 3: Julia implementation
using Distributions
# Log-likelihood
function loglik(data::Matrix, beta::Vector)
eta = beta[1] + beta[2]*data[:,3]
theta = 1./(1 + exp(- eta))
sum(data[:,2].*eta) + sum(data[:,1].*log(1-theta))
end
# Log-posterior up to an additive constant
function lpost(data::Matrix, beta::Vector, sigma_mu::Float64)
norm = Normal(0,sigma_mu)
loglik(data,beta) + logpdf(norm,beta[1]) + logpdf(norm,beta[2])
end
Tommaso Rigon Have you met Julia? May 2, 2016 20 / 25
21. Bayesian statistics with Julia
Metropolis Algorithm
Listing 4: Julia implementation
using Optim # For numerical optimization
using ForwardDiff # For numerical derivative
# Maximum likelihood estimate
beta_hat = optimize(x -> -loglik(data,x),[0.0, 0.0], method=:l_bfgs).minimum
# Observed information matrix
Sigma = inv(ForwardDiff.hessian(x -> -loglik(data,x), beta_hat))
Listing 5: Julia implementation
function Metropolis(R::Int64, Sigma::Matrix, sigma_mu::Float64,start::Vector)
out = zeros(R,2)
beta = start #Initialization
for r in 1:R
beta_star = rand(MvNormal(beta,Sigma)) # Proposal distribution
alpha = exp(lpost(data,beta_star,sigma_mu) - lpost(data,beta,sigma_mu))
if rand(1)[1] < alpha # ‘rand’ is a pseudo random from a Uniform
beta = copy(beta_star) # Copy if accepted
end
out[r,:] = beta
end
out
end
Tommaso Rigon Have you met Julia? May 2, 2016 21 / 25
22. Bayesian statistics with Julia
Performance - Milliseconds in log-scale
R Julia OpenBUGS STAN
10020050010002000
Expression
log(time)[t]
Tommaso Rigon Have you met Julia? May 2, 2016 22 / 25
23. Bayesian statistics with Julia
Global performance
Julia now really shines!
1 For this particular problem Julia ≈ 20 times faster than R. In fact, the for
loops is used extensively and there is no way to vectorize this operation.
2 Also, Julia ≈ 13 times faster than OpenBUGS. However, OpenBUGS does
not necessarily use our Gaussian random walk, but tries to select the “best”
way to do MCMC according to its own criteria. Therefore, a fair comparison
should take into account, at the very least, the autocorrelation of the
sampled chain.
3 Finally, Julia ≈ 10 times faster than STAN but, as for OpenBUGS, we should
be careful to make a direct comparison.
Tommaso Rigon Have you met Julia? May 2, 2016 23 / 25
24. Bayesian statistics with Julia
Bayesian logistic final result
−0.3
−0.2
−0.1
0.0
−5 0 5 10 15
β0
β1
Tommaso Rigon Have you met Julia? May 2, 2016 24 / 25
25. References
Some references about Julia
1 http://julialang.org/
2 http://docs.julialang.org/en/release-0.4/
3 http://distributionsjl.readthedocs.org/en/latest/
Tommaso Rigon Have you met Julia? May 2, 2016 25 / 25