This document discusses techniques for finding the intersection between two large datasets efficiently. It begins by noting that while computers can perform billions of operations per second, datasets are huge, ranging from terabytes to petabytes in size. Two techniques for finding intersections are examined: scanning and sorting. Scanning has a runtime of O(n2) while sorting has a runtime of O(n log n), making sorting faster. Hashing is proposed as an even faster alternative with a runtime of O(n). Reasoning about algorithmic runtimes and how they scale with input size is discussed.
How is the algorithmic thinking is applied in (mathematical) finance? What are the main differences in approaches?
Some basics of theoretical computer science are explained, some hands-on applications to finance are shown together with the industry practice.
This document summarizes a new method for projective splitting algorithms called projective splitting with forward steps. The method allows using forward steps instead of proximal steps when the operator is Lipschitz continuous. This can improve efficiency compared to only using proximal steps. Preliminary computational tests on LASSO problems show the method with greedy block selection and asynchronous delays can speed up convergence compared to non-greedy, synchronous versions. However, more work is still needed to fully understand adaptive step sizes and how to minimize the separation function at each iteration.
"SVM - the original papers" presentation @ Papers We Love BucharestAdrian Florea
1) The document discusses the original papers on support vector machines (SVMs) by Vladimir Vapnik and others. It provides an overview of the key mathematical formulations and algorithms developed in the seminal SVM papers.
2) It describes how SVMs find the optimal separating hyperplane between two classes of patterns by formulating it as a quadratic optimization problem with linear constraints.
3) It also discusses how linear separability of patterns can be expressed as a linear programming problem, and how its solution determines if patterns are linearly separable.
DSD-INT 2018 iMOD version 4.3 double precision big coordinates - VermeulenDeltares
Presentation by Peter Vermeulen (Deltares) at the iMOD International User Day 2018, during Delft Software Days - Edition 2018. Tuesday 13 November 2018, Delft.
Relationship between some machine learning conceptsZoya Bylinskii
A brief review of concepts for an introductory course on machine learning. These concepts include classification vs regression, regularization, generalized linear regression, and kernels. Some visual examples are provided, and relationships between different methods are discussed.
Computational Social Science, Lecture 11: Regressionjakehofman
The document discusses regression analysis and its goals of describing relationships between variables, making predictions, and explaining outcomes. It provides definitions of regression, including that regression analyzes how a variable is distributed across populations defined by other predictor variables. The document also discusses using regression to describe differences in SAT scores between ethnic groups and predict real-world outcomes like movie ticket sales based on online search activity.
Computational Social Science, Lecture 08: Counting Fast, Part IIjakehofman
The document summarizes a lecture on counting triangles (cycles of three nodes) in large network graphs. It discusses how to parallelize the triangle counting algorithm to handle high-degree nodes more efficiently. Specifically, it proposes orienting each edge towards the higher-degree node and processing edges individually to generate candidate triadic closures in a more balanced way across nodes. This approach avoids exponential running times that could occur when counting triangles centered on very high-degree nodes.
Computational Social Science, Lecture 09: Data Wranglingjakehofman
This document provides an overview of different methods for accessing and parsing data, including bulk downloads, APIs, web scraping, and unstructured data. It discusses formats like CSV, JSON, XML and examples of each. It also covers using regular expressions and parsers to extract structured data from unstructured sources.
How is the algorithmic thinking is applied in (mathematical) finance? What are the main differences in approaches?
Some basics of theoretical computer science are explained, some hands-on applications to finance are shown together with the industry practice.
This document summarizes a new method for projective splitting algorithms called projective splitting with forward steps. The method allows using forward steps instead of proximal steps when the operator is Lipschitz continuous. This can improve efficiency compared to only using proximal steps. Preliminary computational tests on LASSO problems show the method with greedy block selection and asynchronous delays can speed up convergence compared to non-greedy, synchronous versions. However, more work is still needed to fully understand adaptive step sizes and how to minimize the separation function at each iteration.
"SVM - the original papers" presentation @ Papers We Love BucharestAdrian Florea
1) The document discusses the original papers on support vector machines (SVMs) by Vladimir Vapnik and others. It provides an overview of the key mathematical formulations and algorithms developed in the seminal SVM papers.
2) It describes how SVMs find the optimal separating hyperplane between two classes of patterns by formulating it as a quadratic optimization problem with linear constraints.
3) It also discusses how linear separability of patterns can be expressed as a linear programming problem, and how its solution determines if patterns are linearly separable.
DSD-INT 2018 iMOD version 4.3 double precision big coordinates - VermeulenDeltares
Presentation by Peter Vermeulen (Deltares) at the iMOD International User Day 2018, during Delft Software Days - Edition 2018. Tuesday 13 November 2018, Delft.
Relationship between some machine learning conceptsZoya Bylinskii
A brief review of concepts for an introductory course on machine learning. These concepts include classification vs regression, regularization, generalized linear regression, and kernels. Some visual examples are provided, and relationships between different methods are discussed.
Computational Social Science, Lecture 11: Regressionjakehofman
The document discusses regression analysis and its goals of describing relationships between variables, making predictions, and explaining outcomes. It provides definitions of regression, including that regression analyzes how a variable is distributed across populations defined by other predictor variables. The document also discusses using regression to describe differences in SAT scores between ethnic groups and predict real-world outcomes like movie ticket sales based on online search activity.
Computational Social Science, Lecture 08: Counting Fast, Part IIjakehofman
The document summarizes a lecture on counting triangles (cycles of three nodes) in large network graphs. It discusses how to parallelize the triangle counting algorithm to handle high-degree nodes more efficiently. Specifically, it proposes orienting each edge towards the higher-degree node and processing edges individually to generate candidate triadic closures in a more balanced way across nodes. This approach avoids exponential running times that could occur when counting triangles centered on very high-degree nodes.
Computational Social Science, Lecture 09: Data Wranglingjakehofman
This document provides an overview of different methods for accessing and parsing data, including bulk downloads, APIs, web scraping, and unstructured data. It discusses formats like CSV, JSON, XML and examples of each. It also covers using regular expressions and parsers to extract structured data from unstructured sources.
Computational Social Science, Lecture 10: Online Experimentsjakehofman
This document discusses experimental design for measuring the effectiveness of advertising. It presents several ideas for experiments, including measuring total store sales after advertising campaigns, comparing online purchases of those who saw an ad to those who did not, and using randomized controlled experiments. It notes challenges like defining good controls and issues of external validity. Specific case studies discussed include measuring ad wear-out using a natural experiment with Yahoo ads and using Yelp rating rounding as a natural experiment to measure ratings' impact on restaurant revenue.
Computational Social Science, Lecture 13: Classificationjakehofman
The document discusses the Naive Bayes classification algorithm. It explains how Naive Bayes can be used to classify students based on whether they visit an online course site. It describes representing each student as a binary vector indicating which sites they visited. The algorithm then treats site visits as independent Bernoulli trials to calculate the probability that a student is in a particular class based on the sites they visited. The document outlines how to estimate the model parameters from training data and use the model to make predictions.
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
The document discusses techniques for analyzing networks and diffusion processes on networks. It describes breadth-first search to find distances between nodes, connected components to identify distinct subgraphs, triangle counting to measure clustering, and methods to model diffusion by identifying influencer relationships between nodes based on time-stamped adoption data.
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
The document discusses networks and graph concepts. It provides examples of social networks and the internet. It defines nodes, edges, and how networks can be represented using adjacency lists and edge lists. It discusses computing properties of networks like degrees, connected components, distances, and performing breadth-first search.
Computational Social Science, Lecture 03: Counting at Scale, Part Ijakehofman
This document discusses techniques for counting and analyzing large datasets at scale. It introduces challenges like being I/O bound, network bound, memory bound, or CPU bound when dealing with terabytes of data. MapReduce is presented as a framework that can distribute work across multiple machines by mapping input to key-value pairs, shuffling/aggregating by key, and reducing on the grouped data. This approach takes advantage of the insight that many tasks are easier when performed on grouped data.
Computational Social Science, Lecture 04: Counting at Scale, Part IIjakehofman
This document discusses descriptive statistics and summarization techniques at scale using MapReduce. It provides examples of how common data processing tasks like filtering, grouping, sorting, joining can be implemented in MapReduce by assigning data to keys and performing the operations during the map and reduce phases. Operations like joins, grouping and aggregations are computationally easier to perform on grouped data.
Computational Social Science, Lecture 02: An Introduction to Countingjakehofman
This document introduces counting methods for analyzing large datasets. Simple counting of observations grouped by key properties allows estimating distributions when data is plentiful. However, counting becomes computationally challenging at large scales. The document proposes streaming methods that process one observation at a time to address memory limitations. It analyzes tradeoffs between computing distributions versus statistics based on available memory and dataset characteristics.
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
This document discusses regression analysis as presented by Jake Hofman of Columbia University. It defines regression as understanding how a response variable varies across subgroups based on predictor variables. The goals of regression are to describe outcomes under different conditions, predict future outcomes, and explain associations between predictors and outcomes. Examples shown include comparing SAT score distributions between ethnic groups and examining the relationship between SAT scores and household income. The framework for regression involves specifying the outcome and predictors, defining a loss function, fitting the model to minimize loss, and assessing performance.
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
The document is a lecture on counting and modeling social data given by Jake Hofman at Columbia University on January 27, 2017. It discusses why counting large datasets is important for social science questions but computationally challenging at large scales. It also covers counting at small to medium scales by loading a dataset into memory and splitting it into groups to compute statistics within each group.
Modeling Social Data, Lecture 1: Overviewjakehofman
This document provides an overview and introduction to a course on modeling social data. It discusses how computational social science uses large-scale electronic data and mathematical models to address long-standing questions in social sciences about how ideas spread and new forms of communication affect society. The course will cover topics like exploratory data analysis, regression, classification, and networks. It also presents a case study that found search engine query data can predict future sales of movies, video games, and music rankings weeks in advance.
Los cinco sentidos (vista, olfato, oído, gusto y tacto) sirven para que los seres humanos puedan conocer y relacionarse con su entorno. Cada sentido está asociado a un órgano en particular (ojo, nariz, oreja, lengua y piel) que capta estímulos y los transmite al cerebro para que los interprete como sensaciones. Los sentidos nos permiten ver, oler, escuchar, probar y tocar nuestro entorno.
El documento proporciona instrucciones en 12 pasos para crear un grupo en Google. Los pasos incluyen iniciar sesión con una cuenta de Gmail, completar los detalles de la cuenta, verificar la cuenta a través de correo electrónico, acceder a la cuenta, seleccionar "Grupos" en la página de inicio, configurar y crear el grupo ingresando la dirección, descripción y nivel de acceso, invitar miembros y enviar un mensaje de invitación, y finalmente poder generar debates y compartir archivos dentro del grupo creado.
El documento resume el sistema solar, incluyendo su descubrimiento y exploración, características generales como las distancias de los planetas, su formación y evolución a partir de una nube molecular hace 4600 millones de años, y los diferentes elementos que lo componen como planetas, satélites, asteroides y cometas.
Aquest és una presentació POWER POINT fet a l'assignatura d'Aules digitals del segon curs del grau en mestre d'educació infantil. Es tracta d'una presentació interactiva per als xiquets on treballar el reciclatge, aprenent els colors dels contenidors, on es tira cada cosa i vocabulari sobre els diferents residus.
El documento describe el modelado UML de un sistema para calcular y enviar notas de crédito con bonificaciones a clientes según sus compras. Se presentan los requerimientos, actores, diagrama de casos de uso para la gestión de bonificaciones, diagrama de clases y diagrama de flujo de datos. El modelo permite especificar la estructura y comportamiento para implementar una solución que calcule bonificaciones según las políticas de la empresa y envíe las notas a los clientes.
La anorexia es un trastorno alimentario más común en mujeres adolescentes que los lleva a creer que están gordos y a dejar de comer, vomitar o hacer exceso de ejercicio. Puede deberse a factores como pensar que se está obeso, problemas familiares, fracasos escolares o sucesos traumáticos. Existen diferentes tipos como la anorexia nerviosa y la anorexia sexual motivada por baja autoestima.
Este documento proporciona una breve descripción de varios instrumentos de vidrio y herramientas comúnmente utilizados en un laboratorio de química, incluidos matraces aforados, matraces Erlenmeyer, tubos de ensayo, gradillas, pipetas, buretas, cristalizadores, vidrio reloj, espátulas, tubos refrigerantes y papel indicador. Cada elemento se describe con su propósito y función principal en el laboratorio.
El documento presenta una serie de bitácoras de observación de clases de educación física que utilizan las TIC. En las bitácoras se describen las actividades realizadas, los materiales utilizados y la organización de los estudiantes. El documento también incluye secciones de retroalimentación donde los estudiantes comentan aspectos positivos y negativos de las clases.
In today’s increasingly challenging project and operations environments and conditions, the challenge of data quality and the inherent data integrity is paramount for both for Engineering, Procurement, Construction contractors and Owner Operators. This presentation shows how AVEVA Engineering can introduce a level of predictability and safety around data management. How can we ensure that the entered data is shared across multiple disciplines and still correct, complete, consistent, coherent, compliant and comprehensive for all stakeholders involved on the project? AVEVA Engineering provides some interesting capabilities to assist you in these areas.
Discover how AVEVA can transform your business at www.aveva.com or contact the team directly at http://www.aveva.com/en/Contact/
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
Stream processing refers to the set of techniques and tools which are used to analyze and performs actions on real-time collected data, such as continuous time series being generated by sensors. The analysis of data streams brings two main benefits, obtain an understanding of the observed environment, including the forces and the structure of physical phenomena, train a model and produce forecasting information. Giuseppe will introduce main concepts underlying stream processing and time series data management.
Visit: https://www.eudat.eu/eudat-summer-school
This document discusses approximate query processing using sampling to enable interactive queries over large datasets. It describes BlinkDB, a framework that creates and maintains samples from underlying data to return fast, approximate query answers with error bars. BlinkDB verifies the correctness of the error bars it returns by periodically replacing samples and using diagnostics to check the accuracy without running many queries. The document discusses challenges like selecting appropriate samples, estimating errors, and verifying results to balance speed, accuracy and correctness for interactive analysis of big data.
Computational Social Science, Lecture 10: Online Experimentsjakehofman
This document discusses experimental design for measuring the effectiveness of advertising. It presents several ideas for experiments, including measuring total store sales after advertising campaigns, comparing online purchases of those who saw an ad to those who did not, and using randomized controlled experiments. It notes challenges like defining good controls and issues of external validity. Specific case studies discussed include measuring ad wear-out using a natural experiment with Yahoo ads and using Yelp rating rounding as a natural experiment to measure ratings' impact on restaurant revenue.
Computational Social Science, Lecture 13: Classificationjakehofman
The document discusses the Naive Bayes classification algorithm. It explains how Naive Bayes can be used to classify students based on whether they visit an online course site. It describes representing each student as a binary vector indicating which sites they visited. The algorithm then treats site visits as independent Bernoulli trials to calculate the probability that a student is in a particular class based on the sites they visited. The document outlines how to estimate the model parameters from training data and use the model to make predictions.
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
The document discusses techniques for analyzing networks and diffusion processes on networks. It describes breadth-first search to find distances between nodes, connected components to identify distinct subgraphs, triangle counting to measure clustering, and methods to model diffusion by identifying influencer relationships between nodes based on time-stamped adoption data.
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
The document discusses networks and graph concepts. It provides examples of social networks and the internet. It defines nodes, edges, and how networks can be represented using adjacency lists and edge lists. It discusses computing properties of networks like degrees, connected components, distances, and performing breadth-first search.
Computational Social Science, Lecture 03: Counting at Scale, Part Ijakehofman
This document discusses techniques for counting and analyzing large datasets at scale. It introduces challenges like being I/O bound, network bound, memory bound, or CPU bound when dealing with terabytes of data. MapReduce is presented as a framework that can distribute work across multiple machines by mapping input to key-value pairs, shuffling/aggregating by key, and reducing on the grouped data. This approach takes advantage of the insight that many tasks are easier when performed on grouped data.
Computational Social Science, Lecture 04: Counting at Scale, Part IIjakehofman
This document discusses descriptive statistics and summarization techniques at scale using MapReduce. It provides examples of how common data processing tasks like filtering, grouping, sorting, joining can be implemented in MapReduce by assigning data to keys and performing the operations during the map and reduce phases. Operations like joins, grouping and aggregations are computationally easier to perform on grouped data.
Computational Social Science, Lecture 02: An Introduction to Countingjakehofman
This document introduces counting methods for analyzing large datasets. Simple counting of observations grouped by key properties allows estimating distributions when data is plentiful. However, counting becomes computationally challenging at large scales. The document proposes streaming methods that process one observation at a time to address memory limitations. It analyzes tradeoffs between computing distributions versus statistics based on available memory and dataset characteristics.
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
This document discusses regression analysis as presented by Jake Hofman of Columbia University. It defines regression as understanding how a response variable varies across subgroups based on predictor variables. The goals of regression are to describe outcomes under different conditions, predict future outcomes, and explain associations between predictors and outcomes. Examples shown include comparing SAT score distributions between ethnic groups and examining the relationship between SAT scores and household income. The framework for regression involves specifying the outcome and predictors, defining a loss function, fitting the model to minimize loss, and assessing performance.
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
The document is a lecture on counting and modeling social data given by Jake Hofman at Columbia University on January 27, 2017. It discusses why counting large datasets is important for social science questions but computationally challenging at large scales. It also covers counting at small to medium scales by loading a dataset into memory and splitting it into groups to compute statistics within each group.
Modeling Social Data, Lecture 1: Overviewjakehofman
This document provides an overview and introduction to a course on modeling social data. It discusses how computational social science uses large-scale electronic data and mathematical models to address long-standing questions in social sciences about how ideas spread and new forms of communication affect society. The course will cover topics like exploratory data analysis, regression, classification, and networks. It also presents a case study that found search engine query data can predict future sales of movies, video games, and music rankings weeks in advance.
Los cinco sentidos (vista, olfato, oído, gusto y tacto) sirven para que los seres humanos puedan conocer y relacionarse con su entorno. Cada sentido está asociado a un órgano en particular (ojo, nariz, oreja, lengua y piel) que capta estímulos y los transmite al cerebro para que los interprete como sensaciones. Los sentidos nos permiten ver, oler, escuchar, probar y tocar nuestro entorno.
El documento proporciona instrucciones en 12 pasos para crear un grupo en Google. Los pasos incluyen iniciar sesión con una cuenta de Gmail, completar los detalles de la cuenta, verificar la cuenta a través de correo electrónico, acceder a la cuenta, seleccionar "Grupos" en la página de inicio, configurar y crear el grupo ingresando la dirección, descripción y nivel de acceso, invitar miembros y enviar un mensaje de invitación, y finalmente poder generar debates y compartir archivos dentro del grupo creado.
El documento resume el sistema solar, incluyendo su descubrimiento y exploración, características generales como las distancias de los planetas, su formación y evolución a partir de una nube molecular hace 4600 millones de años, y los diferentes elementos que lo componen como planetas, satélites, asteroides y cometas.
Aquest és una presentació POWER POINT fet a l'assignatura d'Aules digitals del segon curs del grau en mestre d'educació infantil. Es tracta d'una presentació interactiva per als xiquets on treballar el reciclatge, aprenent els colors dels contenidors, on es tira cada cosa i vocabulari sobre els diferents residus.
El documento describe el modelado UML de un sistema para calcular y enviar notas de crédito con bonificaciones a clientes según sus compras. Se presentan los requerimientos, actores, diagrama de casos de uso para la gestión de bonificaciones, diagrama de clases y diagrama de flujo de datos. El modelo permite especificar la estructura y comportamiento para implementar una solución que calcule bonificaciones según las políticas de la empresa y envíe las notas a los clientes.
La anorexia es un trastorno alimentario más común en mujeres adolescentes que los lleva a creer que están gordos y a dejar de comer, vomitar o hacer exceso de ejercicio. Puede deberse a factores como pensar que se está obeso, problemas familiares, fracasos escolares o sucesos traumáticos. Existen diferentes tipos como la anorexia nerviosa y la anorexia sexual motivada por baja autoestima.
Este documento proporciona una breve descripción de varios instrumentos de vidrio y herramientas comúnmente utilizados en un laboratorio de química, incluidos matraces aforados, matraces Erlenmeyer, tubos de ensayo, gradillas, pipetas, buretas, cristalizadores, vidrio reloj, espátulas, tubos refrigerantes y papel indicador. Cada elemento se describe con su propósito y función principal en el laboratorio.
El documento presenta una serie de bitácoras de observación de clases de educación física que utilizan las TIC. En las bitácoras se describen las actividades realizadas, los materiales utilizados y la organización de los estudiantes. El documento también incluye secciones de retroalimentación donde los estudiantes comentan aspectos positivos y negativos de las clases.
In today’s increasingly challenging project and operations environments and conditions, the challenge of data quality and the inherent data integrity is paramount for both for Engineering, Procurement, Construction contractors and Owner Operators. This presentation shows how AVEVA Engineering can introduce a level of predictability and safety around data management. How can we ensure that the entered data is shared across multiple disciplines and still correct, complete, consistent, coherent, compliant and comprehensive for all stakeholders involved on the project? AVEVA Engineering provides some interesting capabilities to assist you in these areas.
Discover how AVEVA can transform your business at www.aveva.com or contact the team directly at http://www.aveva.com/en/Contact/
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
Stream processing refers to the set of techniques and tools which are used to analyze and performs actions on real-time collected data, such as continuous time series being generated by sensors. The analysis of data streams brings two main benefits, obtain an understanding of the observed environment, including the forces and the structure of physical phenomena, train a model and produce forecasting information. Giuseppe will introduce main concepts underlying stream processing and time series data management.
Visit: https://www.eudat.eu/eudat-summer-school
This document discusses approximate query processing using sampling to enable interactive queries over large datasets. It describes BlinkDB, a framework that creates and maintains samples from underlying data to return fast, approximate query answers with error bars. BlinkDB verifies the correctness of the error bars it returns by periodically replacing samples and using diagnostics to check the accuracy without running many queries. The document discusses challenges like selecting appropriate samples, estimating errors, and verifying results to balance speed, accuracy and correctness for interactive analysis of big data.
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
This document discusses the evolution of data analysis and how Couchbase database can help make data analysis more exciting again. In the past, data analysis used to be exciting because it took days to write analysis programs and results were only available overnight. Now with Couchbase, queries can be built and results retrieved in seconds for huge datasets using MapReduce queries. Couchbase allows slicing data in many ways without effort through its database clusters and JavaScript interface.
Course: Intro to Computer Science (Malmö Högskola)
Some classic data structures within computer science modeling: stacks, queues, vectors, linked-lists, graphs
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
1. Sketch algorithms provide approximate query results with sub-linear space and processing time, enabling analysis of big data that would otherwise require prohibitive resources.
2. Case studies show sketches reduce storage by over 90% and processing time by over 95% compared to exact algorithms, enabling real-time querying and rollups across multiple dimensions that were previously infeasible.
3. The DataSketches library provides open-source implementations of popular sketch algorithms like Theta, HLL, and quantiles sketches, with code samples and adapters for systems like Hive, Pig, and Druid.
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs
There are several challenges in the NoSQL world. Especially if you have very high availability requirements you have to accept temporal inconsistencies which you need to resolve explicitly. This is usually a tough job which requires implementing case by case business logic or even bothering the users to decide about the correct state of your data.Wouldn't it be great if we could solve this conflict resolution and data reconciliation process in a generic way at a pure technical level?That's exactly what CRDTs (Conflict-free Replicated Data Types) are about. CRDTs are data structures that are guaranteed to converge to a desired state while enabling extreme availability of the datastore.In this session you will learn what CRDTs are, how to design them, what you can do with them, what their limitations and tradeoffs are – of course garnished with lots of tips and tricks. Get ready to push the availability of your datastore to the max!
The document discusses techniques for creating small summaries of big data in order to improve computational scalability. It introduces sketch structures as a class of linear summaries that can be merged and updated efficiently. Specific sketch structures discussed include Bloom filters, Count-Min sketches, and Count sketches. It also covers counter-based summaries like the heavy hitters algorithm for finding frequent items in a data stream. The document outlines the structures, analysis, and applications of these various techniques for creating concise summaries of large datasets.
This document provides an introduction to the CSE 326: Data Structures course. It discusses the following key points in 3 sentences or less:
The course will cover common data structures and algorithms, how to choose the appropriate data structure for different needs, and how to justify design decisions through formal reasoning. It aims to help students become better developers by understanding fundamental data structures and when to apply them. The document provides examples of stacks and queues to illustrate abstract data types, data structures, and their implementations in different programming languages.
This document provides an introduction to the CSE 326: Data Structures course. It discusses the following key points in 3 sentences or less:
The course will cover common data structures and algorithms, how to choose the appropriate data structure for different needs, and how to justify design decisions through formal reasoning. It aims to help students become better developers by understanding fundamental data structures and when to apply them. The document provides examples of stacks and queues to illustrate abstract data types, data structures, and their implementations in different programming languages.
This document provides an overview of a Data Structures course. The course will cover basic data structures and algorithms used in software development. Students will learn about common data structures like lists, stacks, and queues; analyze the runtime of algorithms; and practice implementing data structures. The goal is for students to understand which data structures are appropriate for different problems and be able to justify design decisions. Key concepts covered include abstract data types, asymptotic analysis to evaluate algorithms, and the tradeoffs involved in choosing different data structure implementations.
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
Rappresentare lo scorrere del tempo non è un'impresa semplice, specialmente con strumenti "tradizionali". Purtroppo però la dimensione temporale è fondamentale in mille contesti diversi, dall'analisi statistica alla rappresentazione dei rapporti di causa-effetto, dal forecasting al controllo automatico. In questo talk vedremo come utilizzare al meglio OrientDB, un Document-Graph Database, per il salvataggio, l'elaborazione e l'interrogazione di questo tipo di informazioni.
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
This document provides an overview and agenda for a lecture on graph processing using MapReduce. It discusses representing graphs as adjacency matrices or lists, and gives examples of single source shortest path and PageRank algorithms. Graph processing in MapReduce typically involves computations at each node and propagating those computations across the graph. Key challenges include representing graph structure suitably for MapReduce and traversing the graph in a distributed manner through multiple iterations.
The document discusses building a World Wide Telescope by federating astronomical data from different sources and making it accessible via web services. It describes how the Sloan Digital Sky Survey data was made available online by designing typical science questions, implementing the data in a SQL database with spatial indexing extensions, and developing tools and interfaces to enable fast querying. Performance results are presented showing queries can typically be answered in seconds to minutes. Lessons learned include the importance of sequential scans, covering indices, common operations like counting and binning, and use of materialized views and spatial indices.
Neural Networks for Machine Learning and Deep Learningcomifa7406
This document discusses autoencoders and their use in dimensionality reduction and retrieval tasks. It begins by explaining principal component analysis (PCA) and how autoencoders can learn PCA through backpropagation by minimizing reconstruction error. Deep autoencoders are then introduced as a way to perform nonlinear dimensionality reduction by encoding data onto a manifold. Applications discussed include document retrieval, visualization, and hashing. Binary codes learned through deep autoencoders are shown to work well for image retrieval.
This document provides an overview of asymptotic analysis and Landau notation. It discusses justifying algorithm analysis mathematically rather than experimentally. Examples are given to show that two functions may appear different but have the same asymptotic growth rate. Landau symbols like O, Ω, o and Θ are introduced to describe asymptotic upper and lower bounds between functions. Big-Q represents asymptotic equivalence between functions, meaning one can be improved over the other with a faster computer.
This document provides an overview of a course on algorithms and data structures. It outlines the course topics that will be covered over 15 weeks of lectures. These include data types, arrays, matrices, pointers, linked lists, stacks, queues, trees, graphs, sorting, and searching algorithms. Evaluation will be based on assignments, quizzes, projects, sessionals, and a final exam. The goal is for students to understand different algorithm techniques, apply suitable data structures to problems, and gain experience with classical algorithm problems.
Parallel Algorithms for Geometric Graph Problems (at Stanford)Grigory Yaroslavtsev
This document summarizes work on developing parallel algorithms for approximating problems on geometric graphs. Specifically, it presents algorithms for computing a (1+ε)-approximate minimum spanning tree (MST) and earth-mover distance in O(1) rounds of parallel computation using a "solve-and-sketch" framework. The MST algorithm imposes a randomly shifted grid tree and computes MSTs within cells, using only short edges and representative points between cells. This achieves an approximation ratio of 1+O(ε) in O(1) rounds. The framework is also extended to compute a (1+ε)-approximate transportation cost.
This document outlines the key topics that were covered in a class on data visualization, including a brief history of the field, theories of visualization, concepts like dimensions and scope, exploring sample data, mapping data from domains to ranges, and finishing touches like labels, axes, and legends. The instructor provided examples and prompts for students to try applying the concepts to a data exploration exercise of their choosing.
Video in french at https://www.youtube.com/watch?v=9LNnNh63rBI
Sizing an Elasticsearch cluster has to consider many dimensions. In this presentation we go through the different elements and features you should consider to handle big and varying loads of log data.
Similar to Computational Social Science, Lecture 07: Counting Fast, Part I (20)
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
Randomized experiments are the best way to estimate causal effects by reducing confounding variables. However, random assignment is not always possible. Natural experiments use situations where treatments are "as-if randomly" assigned, like hospitalization being more likely on certain days of the week due to staff schedules. While not perfectly random, natural experiments can provide evidence of causal effects when logistically challenging randomized experiments cannot be done. The document discusses limitations of observational studies and importance of considering counterfactuals to estimate causal effects. It also reviews challenges with randomization and potential for bias even in randomized experiments.
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
The document discusses causality and experiments in modeling social data. It explains that while prediction involves forecasting based on observation, causation requires determining the effect of making a change. Three types of causal inference are discussed: observational studies which can be biased by confounding factors; randomized experiments which aim to reduce bias through random assignment but have limitations; and natural experiments which exploit real-world occurrences that resemble random experiments. Overall the document emphasizes that determining causal effects rather than just predictive correlations is important but challenging.
Modeling Social Data, Lecture 10: Networksjakehofman
This document provides a summary of the history and development of network modeling and analysis. It begins with early work in the 1930s conceptualizing relationships as networks. Key developments discussed include random graph theory in the 1960s, theories of clustering and weak ties in the 1970s, small-world networks in the late 1990s, and empirical analyses of network structure and dynamics in the 2000s. The document presents these developments in network science over time.
Modeling Social Data, Lecture 8: Classificationjakehofman
This document discusses various machine learning classification techniques, including naive Bayes classification and boosting. It provides mathematical explanations and code examples for naive Bayes classification using word counts from documents. It also summarizes boosting as minimizing a convex surrogate loss function by iteratively adding weak learners to improve predictive performance. Examples are given of using an exponential loss function and calculating weight updates in boosting.
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
The document discusses model complexity and generalization. It notes that models should be complex enough to explain past data but simple enough to generalize to future data. There is a bias-variance tradeoff, where simpler models have high bias but low variance, while more complex models can have low bias but high variance. The document advocates using cross-validation techniques like k-fold cross-validation to select the best performing model while avoiding overfitting.
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
This document discusses scaling counting problems to large datasets using MapReduce and Hadoop. It begins with an overview of counting at small/medium scales on a single machine versus large scales in parallel. It then discusses the MapReduce programming model, which allows programs to scale transparently by moving code to where the data is located. The document uses a word counting example to illustrate how MapReduce works, involving mapping words to counts, shuffling by word, and reducing by summing the counts.
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
The document discusses data manipulation in R. It notes that R has some quirks with naming conventions and variable types but is well-suited for exploratory data analysis, generating visualizations, and statistical modeling. The tidyverse collection of R packages, including dplyr and ggplot2, helps make data analysis easier by providing tools for reshaping data into a tidy format with one variable per column and observation per row. Dplyr's verbs like filter, arrange, select, mutate and summarize allow for splitting, applying transformations, and combining data in a functional programming style.
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
This document contains lecture slides from Jake Hofman (Columbia University) on recommendation systems. It discusses techniques such as collaborative filtering using k-nearest neighbors and matrix factorization. It provides examples of the Netflix Prize competition to improve movie recommendations and references papers on scaling recommendation algorithms and datasets.
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
The document is a slide presentation on Naive Bayes classification. It introduces the concept of Naive Bayes classification using examples such as medical diagnosis and spam filtering. It then describes the mathematical formulation of Naive Bayes as applying Bayes' rule to classify documents represented as bags of words. The presentation notes that Naive Bayes works better than expected due to its computational efficiency and ability to be updated with new data, though its performance depends on the document representation used.
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
Jake Hofman of Columbia University gave a presentation on counting at scale using MapReduce. He began with an overview of MapReduce and how it allows programs to scale transparently by breaking large problems into smaller parallelized parts. He then demonstrated MapReduce through an example word counting problem, showing how the map and reduce functions work to distribute the work across nodes and aggregate the results. The presentation explained how MapReduce abstracts away the complexities of distributed and parallel processing.
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
This document introduces counting and grouping large datasets. It discusses how to compute statistics on data when the full dataset exceeds memory capacity. It presents three approaches to grouping data: loading the full dataset into memory, streaming to process one observation at a time, and storing per-group histograms. The approaches vary in the types of statistics that can be computed based on the number of observations, groups, and distinct values within each group.
Modeling Social Data, Lecture 1: Case Studiesjakehofman
This document provides an introduction and overview for a course on modeling social data. It discusses the following key points:
1. The course will cover obtaining social data from online human interactions, framing questions about the data as mathematical models, and interpreting model results.
2. Topics will include exploratory data analysis, regression, classification, and networks.
3. Examples of social data that will be examined include demographic diversity on the web and modeling how ideas spread through networks.
NYC Data Science Meetup: Computational Social Sciencejakehofman
An emerging field called computational social science leverages the ability to collect and analyze large datasets to study questions in social sciences. This field is occurring in technology companies and government agencies but could also be established in open academic environments. Computational social science intersects social sciences, statistics, and computer science to address long-standing questions through large-scale data analysis and modeling.
Vowpal Wabbit is a machine learning system that has four main goals: scalable and efficient machine learning, supporting new algorithm research, simplicity with few dependencies, and usability with minimal setup requirements. It uses several "tricks" like feature hashing and caching, online learning, and importance weighting to achieve scalability. It also supports newer algorithms like adaptive learning rates and dimensional correction. Vowpal Wabbit can be run in parallel on large clusters to handle terascale problems with billions of examples.
The document discusses clustering techniques for images using k-means clustering. K-means clustering involves choosing the number of clusters, initializing cluster centers, and iteratively assigning points to the closest cluster and updating cluster centers until convergence. Clustering can be applied to pixels within an image, representing each pixel as an RGB value and finding groups of similar pixels. It can also be applied to images, representing each as a binned RGB intensity histogram and finding groups of similar images.
The document discusses data-driven modeling and recommendation systems. It describes the Netflix Prize competition, which challenged participants to improve Netflix's movie recommendation algorithm. It then covers collaborative filtering techniques like k-nearest neighbors and matrix factorization that make personalized recommendations based on patterns in user ratings data. These approaches aim to predict a user's rating for an unseen item based on ratings from similar users or latent factors inferred from user-item interactions.
This document discusses how functional magnetic resonance imaging (fMRI) data can be used to understand brain function and learn about the mind. It explains that fMRI measures blood flow in the brain over time and that multivariate pattern analysis (MVPA) techniques can be used to predict mental tasks based on spatial patterns of blood flow. Regularization methods like elastic net are important for analysis when the number of voxels is greater than the number of task examples. The document argues that fMRI data and MVPA have already changed understanding of the brain by allowing researchers to predict tasks and "read minds", and that future work should focus on improving these techniques.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Computational Social Science, Lecture 07: Counting Fast, Part I
1. Counting Fast
(Part I)
Sergei Vassilvitskii
Columbia University
Computational Social Science
March 8, 2013
2. Computers are fast!
Servers:
– 3.5+ Ghz
Laptops:
– 2.0 - 3 Ghz
Phones:
– 1.0-1.5 GHz
Overall: Executes billions of operations per second!
2 Sergei Vassilvitskii
3. But Data is Big!
Datasets are huge:
– Social Graphs (Billions of nodes, each with hundreds of edges)
• Terabytes (million million bytes)
– Pictures, Videos, associated metadata:
• Petabytes (million billion bytes!)
3 Sergei Vassilvitskii
4. Computers are getting faster
Moore’s law (1965!):
– Number of transistors on a chip doubles every two years.
4 Sergei Vassilvitskii
5. Computers are getting faster
Moore’s law (1965!):
– Number of transistors on a chip doubles every two years.
For a few decades:
– The speed of chips doubled every 24 months.
Now:
– The number of cores doubling
– Speed staying roughly the same
5 Sergei Vassilvitskii
6. But Data is Getting Even Bigger
Unknown author, 1981 (?):
– “640K ought to be enough for anyone”
Eric Schmidt, March 2013:
– “There were 5 exabytes of information created between the dawn of
civilization through 2003, but that much information is now created
every 2 days, and the pace is increasing.”
6 Sergei Vassilvitskii
7. Data Sizes
What is Big Data?
– MB in 1980s Hard Drive Capacity
– GB in 1990s
– TB in 2000s
– PB in 2010s
7 Sergei Vassilvitskii
8. Working with Big Data
Two datasets of numbers:
– Want to find the intersection (common values)
– Why?
• Data cleaning (these are missing values)
• Data mining (these are unique in some way)
8 Sergei Vassilvitskii
9. Working with Big Data
Two datasets of numbers:
– Want to find the intersection (common values)
– Why?
• Data cleaning (these are missing values)
• Data mining (these are unique in some way)
– How long should it take?
• Each dataset has 10 numbers?
• Each dataset has 10k numbers?
• Each dataset has 10M numbers?
• Each dataset has 10B numbers?
• Each dataset has 10T numbers?
9 Sergei Vassilvitskii
10. How to Find Intersections?
10 Sergei Vassilvitskii
11. Idea 1: Scan
Look at every number in list 1:
– Scan through dataset 2, see if you find a match
common_elements = 0
for number in dataset1:
for number2 in dataset2:
if number1 == number2:
common_elements +=1
11 Sergei Vassilvitskii
12. Idea 1: Scanning
For each element in dataset 1, scan through dataset 2, see if it’s present
common_elements = 0
for number in dataset1:
for number2 in dataset2:
if number1 == number2:
common_elements +=1
Analysis: Number of times if statement executed?
– |dataset2| for every iteration of outer loop
– |dataset1| * |dataset2| in total
12 Sergei Vassilvitskii
13. Idea 1: Scanning
Analysis: Number of times if statement executed?
– |dataset2| for every iteration of outer loop
– |dataset1| * |dataset2| in total
Running time:
– 100M * 100M = 1016 comparisons in total
– At 1B (109) comparisons / second
13 Sergei Vassilvitskii
14. Idea 1: Scanning
Analysis: Number of times if statement executed?
– |dataset2| for every iteration of outer loop
– |dataset1| * |dataset2| in total
Running time:
– 100M * 100M = 1016 comparisons in total
– At 1B (109) comparisons / second
– 107 seconds ~ 4 months!
– Even with 1000 computers: 104 seconds -- 2.5 hours!
14 Sergei Vassilvitskii
15. Idea 2: Sorting
Suppose both sets are sorted
– Keep pointers to each
– Check for match, increase the smaller pointer
[Blackboard]
15 Sergei Vassilvitskii
16. Idea 2: Sorting
sorted1 = sorted(dataset1)
sorted2 = sorted(dataset2)
pointer1, pointer2 = 0
common_elements = 0
while pointer1 < size(dataset1) and pointer2 < size(dataset2):
if sorted[pointer1] == sorted[pointer2]:
common_elements+=1
pointer1+=1; pointer2+=1
else if sorted[pointer1] < sorted[pointer2]:
pointer1+=1
else:
pointer2+=1
Analysis:
– Number of times if statement executed?
– Increment a counter each time: |dataset1|+|dataset2|
16 Sergei Vassilvitskii
17. Idea 2: Sorting
Analysis:
– Number of times if statement executed?
– Increment a counter each time: |dataset1|+|dataset2|
Running time:
– At most 100M + 100M comparisons
– At 1B comparisons/second ~ 0.2 seconds
– Plus cost of sorting! ~1 second per list
– Total time = 2.2 seconds
17 Sergei Vassilvitskii
18. Reasoning About Running Times (1)
Worry about the computation as a function of input size:
– “If I double my input size, how much longer will it take?”
• Linear time (comparisons after sorting): twice as long!
• Quadratic time (scan): four (22) times as long
• Cubic time (very slow): 8 (23) time as long
• Exponential time (untenable):
• Sublinear time (uses sampling, skips over input)
18 Sergei Vassilvitskii
19. Reasoning About Running Times (2)
Worry about the computation as a function of input size.
Worry about order of magnitude, not exact running time:
– Difference between 2 seconds and 4 seconds much smaller than
between 2 seconds and 3 months!
• The scan algorithm does more work in the while loop (but only a constant more
work) -- 3 comparisons instead of 1.
• Therefore, still call it linear time
19 Sergei Vassilvitskii
20. Reasoning about running time
Worry about the computation as a function of input size.
Worry about order of magnitude, not exact running time.
Captured by the Order notation: O(.)
– For an input of size n, approximately how long will it take?
– Scan: O(n2)
– Comparisons after sorted: O(n)
20 Sergei Vassilvitskii
21. Reasoning about running time
Worry about the computation as a function of input size.
Worry about order of magnitude, not exact running time.
Captured by the Order notation: O(.)
– For an input of size n, approximately how long will it take?
– Scan: O(n2)
– Comparisons after sorted: O(n)
– Sorting = O(n log n)
• Slightly more than n,
• But much less than n2.
21 Sergei Vassilvitskii
22. Avoiding Sort: Hashing
Idea 3.
– Store each number in list1 in a location unique to it
– For each element in list2, check if its unique location is empty
[Blackboard]
22 Sergei Vassilvitskii
23. Idea 3: Hashing
table = {}
for i in range(total):
table.add(dataset1[i])
common_elements = 0
for i in range(total):
if (table.has(dataset2[i])):
common_elements+=1
Analysis:
– Number of additions to the table: |dataset1|
– Number of comparisons: |dataset2|
– If Additions to the table and comparisons are 1B/second
– Total running time is: 0.2s
23 Sergei Vassilvitskii
24. Lots of Details
Hashing, Sorting, Scanning:
– All have their advantages
– Scanning: in place, just passing through the data
– Sorting: in place (no extra storage), much faster
– Hashing: not in place, even faster
24 Sergei Vassilvitskii
25. Lots of Details
Hashing, Sorting, Scanning:
– All have their advantages
– Scanning: in place, just passing through the data
– Sorting: in place (no extra storage), much faster
– Hashing: not in place, even faster
Reasoning about algorithms:
– Non trivial (and hard!)
– A large part of computer science
– Luckily mostly abstracted
25 Sergei Vassilvitskii
27. Distributed Computation
Working with large datasets:
– Most datasets are skewed
– A few keys are responsible for most of the data
– Must take skew into account, since averages are misleading
27 Sergei Vassilvitskii
28. Additional Cost
Communication cost
– Prefer to do more on a single machine (even if it’s doing more work) to
constantly communicating
– Why? If you have 1000 machines talking to 1000 machines --- that’s
1M channels of communication
– The overall communication cost grows quadratically, which we have
seen does not scale...
28 Sergei Vassilvitskii
30. Doing the study
Suppose you had the data available. What would you do?
If you have a hypothesis:
– “Taking both Drug A and Drug B causes a side effect C”?
30 Sergei Vassilvitskii
31. Doing the study
If you have a hypothesis:
– “Taking both Drug A and Drug B causes a side effect C”?
Look at the ratio of observed
symptoms over expected
- Expected: fraction of people who
took drug A and saw effect C.
A B - Observed: fraction of people who
took drugs A and B and saw effect C.
C
31 Sergei Vassilvitskii
32. Doing the study
If you have a hypothesis:
– “Taking both Drug A and Drug B causes a side effect C”?
Look at the ratio of observed
symptoms over expected
- Expected: fraction of people who
took drug A and saw effect C.
A B - Observed: fraction of people who
took drugs A and B and saw effect C.
This is just counting!
C
32 Sergei Vassilvitskii
33. Doing the study
Suppose you had the data available. What would you do?
Discovering hypotheses to test:
– Many pairs of drugs, some co-occur very often
– Some side effects are already known
33 Sergei Vassilvitskii