ThinkFast: Scaling Machine Learning to Modern Demands

•Download as PPTX, PDF•

0 likes•2,107 views

This document discusses scaling machine learning to meet modern demands for analyzing massive datasets. It notes initiatives like the Precision Medicine Initiative that will generate terabytes of genomic data per person, and how statistical learning and structured regularization can help analyze such data. The document presents machine learning as a "query language" that can be optimized like database queries by using the mathematical structure of learning problems and efficient feature storage. It provides examples of applications in bioinformatics that have improved results over state-of-the-art using these techniques.

Technology

ThinkFast: Scaling Machine Learning
to Modern Demands
Hristo Paskov

The Genomic Data Deluge
• Precision Medicine
Initiative: sequence
1,000,000 genomes
– $215 Million in 2015
– Pilot study
– Outputs 10-50 GB/person
How do we analyze all of this data to drive
progress?

Massive Data Sources
News
eCommerce
Bioinformatics
100K Genomes
Social Media

The Analysis Refinement Cycle
⨂
Data
1
2
𝑦 − 𝑋𝑤 2
2
+
𝜆
2
𝑤 2
2
Model
𝑥+
= 𝑥 − 𝛼𝑀𝛻𝑓 𝑥
Solver
Model
captures
data
nuance?
Solver
exists, is
fast
enough?
Yes? Proceed
! No? Quit
Increase time, money, experience, resources

More Than Just Training Models
• Regularization paths
• Model risk assessment
• Interpretability
ModelCoefficient
Regularization Parameter

Brief History of Statistical Learning
Interpretability & Statistical Guarantees
Scalability
Ease of
Use
Simple
Models
Kernel
Methods
Trees &
Ensembles
Structured
Regularization

Structured Regularization
Losses
Regression
Classification
Ranking
Motif Finding
Matrix Factorization
Feature Embedding
Data Imputation
…
Regularizers
Sparsity
Spatial/ Temporal /
Manifold Structure
Group Structure
Hierarchical Structure
Structured & Unstructured
Multitask Learning
…
min
𝛽∈ℝ 𝑑
𝐿 𝑋𝛽 + 𝜆𝑅 𝛽

The Lasso’s Combinatorial Side
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝜆
0
3
2
1
4
ModelCoefficient

The Database Perspective
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1

The Database Perspective
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Feature & label storage

The Database Perspective
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Feature & label storage
Data access operations
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇 𝑣

The Database Perspective
−𝑋 𝑇
𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1
Feature & label storage
Data access operations
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇 𝑣
ML “Query Language” min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1

The Database Perspective
min
𝛽1,𝛽2,𝛽3∈ℝ 𝑑
𝑡=1
3
𝐿 𝑡 𝑦𝑡 − 𝑋𝑡 𝛽𝑡 + 𝜆 𝑡 𝑅𝑡 𝛽𝑡
+𝜔 𝛽1 𝛽2 𝛽3 ∗

The Database Perspective
Feature, label and
model storage
Data access operations
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇
𝑣
ML “Query Language” min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝑀1
𝑀2
𝑀1
𝑀2
𝑀3
𝑀1
𝑀2

The Database Perspective
𝑢 = 𝑦 − 𝑋𝛽
𝑣 = 𝜕 𝑢 𝐿 𝑢
𝑤 = 𝑋 𝑇
𝑣
min
𝛽∈ℝ 𝑑
𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1
𝑀1
𝑀2
𝑀1
𝑀2
𝑀3
𝑀1
𝑀2
Processing Memory
Mathematical
Structure

“Query Language” Optimization
• Static analysis
𝑦 − 𝑋𝑤 2
2
+ 𝑤 2
2
𝑦 − 𝑋𝑤 2
2
+ 𝑤 1
?
𝑦 − 𝑋𝑤 2
2
+
1
2
𝑤 2
2
+ 𝑤 1

“Query Language” Optimization
• Static analysis
𝑦 − 𝑋𝑤 2
2
+ 𝑤 2
2
𝑦 − 𝑋𝑤 2
2
+ 𝑤 1
𝑦 − 𝑋𝑤 2
2
+
1
2
𝑤 2
2
+ 𝑤 1
?
𝜀 𝑦 − 𝑋𝑤 +
1
2
𝑤 2
2
+ 𝑤 1

“Query Language” Optimization
• Static analysis
• Runtime analysis

Some Bioinformatics Applications
• Personalized medicine, Memorial Sloan
Kettering Cancer Center
– 35% accuracy improvement over state-of-the-art
• Metagenomic binning and DNA quality
assessment, Stanford School of Medicine
– Previously unsolved problem
• Toxicogenomic analysis, Stanford University
– Improved on state-of-the-art results

Upcoming
• Massive scale character level sentiment and
text analysis on Amazon data
– Billions of features, hours to solve a model
– Efficient multitask learning
• Characterize the global limitations of learning
word structure
– Devise provably more efficient regularizers for
uncovering structure

What's hot

Data Science Training

Data science

Data Mining

Big Data Analytics

`Data mining

Data mining techniques unit 2

malathieswaran29

Machine Learning with Big Data using Apache Spark

InSemble

introduction to data mining tutorial

Salah Amean

Application of KDD & its future scope

Tanmay Sethi

Introduction to Datamining Concept and Techniques

Sơn Còm Nhom

Dwdmunit1 a

bhagathk

Data mining and Machine learning expained in jargon free & lucid language

q-Maxim

Data mining course learning outcomes,Data Mining CMAP

jaya lakshmi

Kdd process

Rajesh Chandra

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

error007

"Demystifying Big Data by AIBDP.org

AIBDP

Data visualization is often used as the first step while performing a variety of analytical tasks. With the advent of large, high-dimensional datasets and strong interest in data science, there is a need for tools that can support rapid visual analysis. In this paper we describe our vision for a new class of visualization recommendation systems that can automatically identify and interactively recommend visualizations relevant to an analytical task.

Towards Visualization Recommendation Systems

Aditya Parameswaran

What's hot (17)

Data Science Training

Data science

Data Mining

Big Data Analytics

`Data mining

Data mining techniques unit 2

Machine Learning with Big Data using Apache Spark

introduction to data mining tutorial

Application of KDD & its future scope

Introduction to Datamining Concept and Techniques

Dwdmunit1 a

Data mining and Machine learning expained in jargon free & lucid language

Data mining course learning outcomes,Data Mining CMAP

Kdd process

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

"Demystifying Big Data by AIBDP.org

Towards Visualization Recommendation Systems

Viewers also liked

by Noelle Sio Saldana Principal Data Scientist at Pivotal The success of a Data Science project is not simply the model fit or the accuracy of its predictions; it is whether those models are being leveraged to make smarter business decisions. Over the past few years, Pivotal’s Data Scientists have experimented with software development methods practiced and taught by their Pivotal Labs counterparts in engineering, design and product management. By reframing Data Science as building software and products instead of research, we found that we reaped similar benefits: shorter and more productive iterations, and clients who actually used the models that we built and skills we taught long after we left. In this talk, we discuss how we have successfully (and maybe not as successfully) borrowed principles from practices like Lean and Agile to Data Science. Topics include: Minimum Viable Product Models Build-Measure-Learn instead of a silver bullet Pair programming Scrums and retrospectives Practicing empathy instead of elitism

Lean Data Science

Domino Data Lab

by Paco Nathan Director, Learning Group at O’Reilly Media This talk will present: * the system architecture based on Jupyter as middleware, plus Thebe, Docker, Mesos, Nginx, etc. * data analytics and project experiences based on delivering _computable content_ at scale * supporting theory for this pedagogical approach, including Knuth’s _Literate Programming_ * media production techniques that use the video as _subtext_ We will also consider the use of notebooks (Jupyter and others) in an organizational context: how do notebooks help teams share and learn? what impact might notebooks have on developer collaboration that is currently focused on IDEs? The resulting medium provides highly effective tooling for a data-centric organization.

Computable content: Notebooks, containers, and data-centric organizational le...

Domino Data Lab

by Szilard Pafka Chief Scientist at Epoch Szilard studied Physics in the 90s in Budapest and has obtained a PhD by using statistical methods to analyze the risk of financial portfolios. Next he has worked in finance quantifying and managing market risk. A decade ago he moved to California to become the Chief Scientist of a credit card processing company doing what now is called data science (data munging, analysis, modeling, visualization, machine learning etc). He is the founder/organizer of several data science meetups in Santa Monica, and he is also a visiting professor at CEU in Budapest, where he teaches data science in the Masters in Business Analytics program. While extracting business value from data has been performed by practitioners for decades, the last several years have seen an unprecedented amount of hype in this field. This hype has created not only unrealistic expectations in results, but also glamour in the usage of the newest tools assumably capable of extraordinary feats. In this talk I will apply the much needed methods of critical thinking and quantitative measurements (that data scientists are supposed to use daily in solving problems for their companies) to assess the capabilities of the most widely used software tools for data science. I will discuss in details two such analyses, one concerning the size of datasets used for analytics and the other one regarding the performance of machine learning software used for supervised learning.

No-Bullshit Data Science

Domino Data Lab

by William Whipple Neely Director of Data Science at Electronic Arts Data scientists and analysts write code, sometimes a lot of code, so we are also software developers as much as model builders and algorithm creators. This talk is about the challenges a team of data scientists and analysts face when trying to scale their work, to make their work repeatable and testable. I’ll talk about how our data science team is leveling-up their skills as software developers, the challenges we’ve faced and the strategies that are helping.

Data Scientists Are Analysts Are Also Software Engineers

Domino Data Lab

Data Science and Goodhart's Law

Domino Data Lab

Success Through an Actionable Data Science Stack

Domino Data Lab

Sentiment Analysis of Film-Related Messages on Social Media

Domino Data Lab

Capturing the Mirage: Machine Learning in Media and Entertainment Industries

Domino Data Lab

A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Domino Data Lab

Open Data for Social Good

Domino Data Lab

The Right Question

Domino Data Lab

Realtime Learning: Using Triggers to Know What the ?$# is Going On

Domino Data Lab

0 esn no_transmisibles_2016

rikard0

Machine Learning at Netflix

Domino Data Lab

Challenges of Predicting User Engagement

Domino Data Lab

At the end of the lecture the participant will be able to: 1. Describe the principles of splinting in nerve repair 2. Describe and perform qualitative assessment of nerve recovery 3. Understand the role of the brain in nerve recovery and rehabilitation 4. Develop a management strategy for sensory and motor rehabilitation post repair 5. Identify poor outcomes early and describe principles of management 130 – 200 pm Principles of nerve rehabilitation (JS) At the end of the lecture the participant will be able to: 1. Understand the principles of nerve rehabilitation 2. Understand the concept of the plasticity of the brain and its role in rehabilitation 3. Functional assessment of nerve functions and return to work strategies

Nerve repair postop rehab

Vaikunthan Rajaratnam

2016 taller-1-de-propiedades-de-los-fluidos-de-yacimientos-copia-1.1 (1)

YULIETH ROJAS

Miúdos a votos – 5º a (divulgação)

paulocapelo

Evaluation technologies

caityduggan

Proyecto de pequeña empresa.

Elias Salvador Torres Sandoval

Viewers also liked (20)

Lean Data Science

Computable content: Notebooks, containers, and data-centric organizational le...

No-Bullshit Data Science

Data Scientists Are Analysts Are Also Software Engineers

Data Science and Goodhart's Law

Success Through an Actionable Data Science Stack

Sentiment Analysis of Film-Related Messages on Social Media

Capturing the Mirage: Machine Learning in Media and Entertainment Industries

A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Open Data for Social Good

The Right Question

Realtime Learning: Using Triggers to Know What the ?$# is Going On

0 esn no_transmisibles_2016

Machine Learning at Netflix

Challenges of Predicting User Engagement

Nerve repair postop rehab

2016 taller-1-de-propiedades-de-los-fluidos-de-yacimientos-copia-1.1 (1)

Miúdos a votos – 5º a (divulgação)

Evaluation technologies

Proyecto de pequeña empresa.

Similar to ThinkFast: Scaling Machine Learning to Modern Demands

Large scale data processing analyses and makes sense of large amounts of data. Although the field itself is not new, it is finding many usecases under the theme "Bigdata" where Google itself, IBM Watson, and Google's Driverless car are some of success stories. Spanning many fields, Large scale data processing brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture. Some usecases like Urban Planning can be slow, which is done in batch mode, while others like stock markets need results within Milliseconds, which are done in streaming fashion. There are different technologies for each case: MapReduce for batch processing and Complex Event Processing and Stream Processing for real-time usecases. Furthermore, the type of analysis range from basic statistics like mean to complicated prediction models based on machine Learning. In this talk, we will discuss data processing landscape: concepts, usecases, technologies and open questions while drawing examples from real world scenarios. http://icter.org/conference/invited_speeches

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...

Srinath Perera

Semantic Analysis to Compute Personality Traits from Social Media Posts

Giulio Carducci

Machine Learning of Natural Language

butest

How does Twitter track the top trending topics? How does Amazon keep track of the top-selling items for the day? How many cabs have been booked this month using your App? Is the password that a new user is choosing a common/compromised password? Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now". At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.

Approximate "Now" is Better Than Accurate "Later"

NUS-ISS

Large Scale Data Mining using Genetics-Based Machine Learning

jaumebp

With the increased availability of both cloud computing and AI libraries arrives the opportunity to automatically search, or optimize machine learning algorithms. While this technology has been around for almost twenty years and seeing renewed interest lately, only recently has the computing power become widespread enough to fully take advantage of it by a growing community of data scientists across many different types of opportunities. Because machine learning still remains a rather challenging discipline for most, I advocate for a more “assistive” approach to AutoML that helps the data scientist learn about different methods within the entire machine learning pipeline, as well as create a knowledge graph of results that can be further mined and explored to gain knowledge and connect with other individuals who are also searching for machine learning pipelines. In this talk, I will present an overview of the approach, published recently in IJCAI and AAAI, and provide new unpublished results demonstrating its effectiveness on public data sets.

AutoML for Data Science Productivity and Toward Better Digital Decisions

Steven Gustafson

In Traveloka's Inaugural Data Meetup held in April 2017, Ainun Najib (Head of Data), Dr. Philip Thomas (Lead Data Scientist), and Rendy B. Junior (Lead Data Engineer) shared about the journey that Traveloka's Data Team have taken so far so that the audience can learn from the struggles and triumphs in managing Traveloka's burgeoning data. You will learn more about: 1) Data culture in Traveloka 2) Data engineering in Traveloka 3) Data science in Traveloka To follow our LinkedIn page, visit bit.ly/TravelokaLinkedInPage Safe Harbor Statement Our discussion may include predictions, estimates or other information that might be considered conclusive. While these conclusive statements represent our current judgment on the best practices, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on our statements, which reflect our opinions only as of the date of this presentation. Please keep in mind that we are not obligating ourselves to revise or publicly release the results of any revision to these presentation materials in light of new information or future events.

How to Feed a Data Hungry Organization – by Traveloka Data Team

Traveloka

cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...

Ilkay Altintas, Ph.D.

A Production Quality Sketching Library for the Analysis of Big Data

Databricks

00-01 DSnDA.pdf

SugumarSarDurai

Predictive analytics and big data tutorial

Benjamin Taylor

Machine Learning for Data Extraction

Dasha Herrmannova

Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....

QBiC_Tue

Predictive model and segmented sensitivity analysis

Bill Liu

Big data 4 webmonday

Daniel Koller

If there is one crucial thing in building ML models, this would be the data preparation. That is the process of transforming raw data to a state where machine learning algorithms could be run to disclose insights and make predictions. Data preparation involves analysis, depends on the nature of the problem and the particular algorithms. As far as there are knowledge and experience involved, there is no such thing as automation, which makes the role of the data scientist the key to success. ML is trendy and Microsoft already have more than 10 services to support ML. So we will focus on tools like Azure ML Workbench and Python for data preparation, review some common tricks to approach data and experiment in Azure ML Studio.

Prepare your data for machine learning

Ivo Andreev

eScience: A Transformed Scientific Method

Duncan Hull

Machine learning Introduction

Dong Guo

IQSS Presentation to Program in Health Policy

alexstorer

ML MODULE 1_slideshare.pdf

Shiwani Gupta

Similar to ThinkFast: Scaling Machine Learning to Modern Demands (20)

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...

Semantic Analysis to Compute Personality Traits from Social Media Posts

Machine Learning of Natural Language

Approximate "Now" is Better Than Accurate "Later"

Large Scale Data Mining using Genetics-Based Machine Learning

AutoML for Data Science Productivity and Toward Better Digital Decisions

How to Feed a Data Hungry Organization – by Traveloka Data Team

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...

A Production Quality Sketching Library for the Analysis of Big Data

00-01 DSnDA.pdf

Predictive analytics and big data tutorial

Machine Learning for Data Extraction

Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....

Predictive model and segmented sensitivity analysis

Big data 4 webmonday

Prepare your data for machine learning

eScience: A Transformed Scientific Method

Machine learning Introduction

IQSS Presentation to Program in Health Policy

ML MODULE 1_slideshare.pdf

More from Domino Data Lab

While business analysis rapidly grows more data-driven, the analyst community is slow to adapt the best practices of data science workflows. Many parallels exists between data science “top topics” (e.g. reproducibility) and business pain points, but these common needs are obscured by the different “languages” of these two communities. The opportunity cost is greatest in heavily regulated industries such as finance and insurance where documentation and compliance are paramount. In this talk, we will review our experience transitioning Capital One business analysts from legacy systems to open-source workflows by developing user-friendly tools. We incentivized business analysts to adopt the data science mindset by curating open-source tools and developing code packages which simplify workflows and eliminate pain points. Our internal R package, tidycf, reimagines cumbersome Excel cashflow statements as dataframes and uses RMarkdown templates and the RStudio IDE for an intuitive, user-friendly experience without the overhead of maintaining a custom GUI. We tackle challenges in documentation and communication while immersing new users in the R language. We will share best practices and lessons learned from our experience designing tools for non-technical end-users, standardizing workflows based on the RStudio IDE’s infrastructure, and evangelizing data science methods.

What's in your workflow? Bringing data science workflows to business analysis...

Domino Data Lab

In this talk, we’ll describe NoSQL (“not-only SQL”) and document-oriented databases and the value they provide for data science companies like Uptake. We will walk through the unique challenges such datastores pose for data science workflows. To make these challenges and lessons learned concrete, we’ll explore data science workflows through a discussion of the development efforts that led to “uptasticsearch”, an R package released by the Uptake Data Science team to reduce friction in interacting with a document store called Elasticsearch. The talk will conclude with a discussion of recent developments in NoSQL technologies and implications for data scientists.

The Proliferation of New Database Technologies and Implications for Data Scie...

Domino Data Lab

Since 2004, Illinois has collected demographic information about traffic stops conducted by police in an effort to identify racial bias. This data has been used by groups such as the ACLU and the Stanford Open Policing Project to identify key markers that infer racial bias in policing. We have applied exploratory data analysis to investigate whether systemic racial bias may appear and to what extent. This talk will walk the audience through the insights gleaned from the exploration of this data along with the challenges posed and ongoing questions raised.

Racial Bias in Policing: an analysis of Illinois traffic stops data

Domino Data Lab

Analytics and data science are ever growing fields, as business decision makers continue to use data to drive decisions. The pinnacle of these fields are the models and their accuracy/fit,; what about the data? Is your data clean, and how do you know that? Our discussion will focus on best practices for data preprocessing for analytic uses. Beginning with essential distributional checks of a dataset to a propose method for automated data validation process during ETL for transactional data.

Data Quality Analytics: Understanding what is in your data, before using it

Domino Data Lab

Recent technological advances, a dynamic competitive landscape, and an evolving regulatory environment have led to a period of rapid innovation for many insurance providers. Here, we’ll explore how data scientists may use randomized experiments to rigorously assess the causal impact of innovations on business outcomes. Particular emphasis will be placed on experimentation in “offline” channels, with some of the challenges and mitigation strategies highlighted.

Supporting innovation in insurance with randomized experimentation

Domino Data Lab

Cars.com Inc. is a decision engine for car buyers and a growth engine for our partners. Data Science is the bread and butter of any decision engine and Cars is no different. In this talk, I will discuss how we quantify various parameters of a car and plan to make use of all the data in hand to put predictive models at various stages of a users’ automobile lifecycle. This talk will also cater to students looking to gain knowledge on how data science is utilized at scale while still following certain processes and leading the way for business and product partners.

Leveraging Data Science in the Automotive Industry

Domino Data Lab

Lake Michigan and outdoor recreation are enjoyable aspects of summers in Chicago, but it can come with risk of potential E. coli in Lake Michigan or West Nile Virus from mosquitos. This summer, the City of Chicago launched two new predictive analytics projects to forecasts the risks and to proactively limit these risks. Members of the research team, Gene Leynes and Nick Lucius discuss the projects and how they’re being used as part of city operations.

Summertime Analytics: Predicting E. coli and West Nile Virus

Domino Data Lab

Reproducible Dashboards and other great things to do with Jupyter

Domino Data Lab

Today, more than ever before, maps are being used to bring data to life. In this presentation I will demonstrate how geoviz can make data science more tangible by providing an interactive canvas for spatial data. Gregory Brunner will shows several examples of how maps are being used enhance how we communicate data and how this applies across all scales, including spatial, temporal, and size of data.

GeoViz: A Canvas for Data Science

Domino Data Lab

Managing Data Science | Lessons from the Field

Domino Data Lab

Doing your first Kaggle (Python for Big Data sets)

Domino Data Lab

Most of analytics modeling work today focuses on the production of single-purpose "artisanal" models for predictions. This approach to analytics is fragile with respect to model consistency, reorganization, and resource availability. This talk will argue that instead the focus of analytics modeling should be toward the production of analytics interchangeable parts, which can be combined in creative ways to produce a wide variety of analytics results. This "nuts and bolts" approach allows analytics groups to produce results in an agile way where the time between ask and answer is determined by the right combination of analytics, rather than the modeling.

Leveraged Analytics at Scale

Domino Data Lab

How I Learned to Stop Worrying and Love Linked Data

Domino Data Lab

Although both disciplines are unique in their own ways, Software Engineering and Data Science make heavy use of programing languages to do their respective jobs. Data Science is a relatively new discipline and many of its practitioners have not previously been professional software engineers. There are a few techniques that Data Scientists can leverage from Software Engineering in order to make their tooling and environments, faster to design, more easily debugged and most importantly, clearer to read. This talk will be going over some practical tips that anyone can use to help better understand their code; give clarity around cloud environments, their uses and drawbacks and finally briefly touching on the Software Development Lifecycle.

Software Engineering for Data Scientists

Domino Data Lab

Within marketing research, big data is often described as being “census” data for the population that it represents. The devil is in the details and when we take a closer look we can see that this isn’t the case. There are many situations that are not captured within the population that big data purports to be a census of. Big data isn’t even a census of itself since it’s not uncommon for records to be excluded either by accident during the collection process or by design in the cleaning processor. Unfortunately, our industry is so enamored with the size of big data that some users of data are willing to trade off precision for tonnage. Fortunately, if the shortcomings of big data are understood and corrected it can accurately represent the population that it measures in the correct proportion to the universe. We will discuss a method that Nielsen has developed called “Common Homes” that is designed to identify and correct the shortcomings of big data sets that represent media consumption.

Making Big Data Smart

Domino Data Lab

The exponential growth of Big Data and Analytics has outpaced the ability of organizations to govern their data appropriately. The ability to reuse the work done by data scientists work is becoming an economic necessity. The mix of data sources is changing from tradition transactional and ERP systems to include a mix of structured, semi-structured and unstructured data. Data Governance needs to adapt to these changes. This session discusses these data changes and proposed how to adapt current data governance processes. These include, how the concept of a stakeholder has changed and the need for expansion of communications and content management. We look at need to consolidate data from disparate systems and how it governed. Lastly we will investigate how context is emerging as an important factor in governance and how it can be leveraged to provide for accurate, reliable data reuse.

Moving Data Science from an Event to A Program: Considerations in Creating Su...

Domino Data Lab

Big Data analytics is well known to uncover hidden insights that gives an organization an edge over the competition. But data does not need to be big in order to be useful. Smaller companies and startups may lack the volume of data that qualifies as big data, yet the variety of data can still yield a trove of insights that helps in driving the business strategies of a company. Startups may also lack the resources to fund an additional, seemingly expensive development project. The key is in simplicity, start small, simple and architect for scalability and performance. But how do you start? In this presentation, we share our experience in building a cost effective, AWS serverless data analytics platform that became an invaluable tool for sales, marketing and operational efficiencies.Serverless architectures simplify development work where servers and software are managed by a third party cloud provider. Developers can focus on just building the data wrangling and data analysis logic where critical aspects like scalability and high availability are guaranteed by the cloud provider. Besides, serverless services offer the pay as you go model, where you pay only based on the amount of resources you use. This turns out to be another attractive aspect where costs can be managed based on the usage. In this presentation we will focus on techniques and best practices to build a big data analytics platform using AWS serverless services like Lambda, DynamoDB, S3, Kinesis, Athena, QuickSight and Amazon ML. We will highlight the strengths of each of these services and what role each plays in the data analytics pipeline. We compare and contrast these services with some of the other popularly used big data technologies like Hadoop, Spark and Kafka. We also demonstrate the usage of these services to build intelligent components that detect anomalies, yield recommendations, simulate chat bots and generate predictive analytics.

Building Data Analytics pipelines in the cloud using serverless technology

Domino Data Lab

The data science process seeks to transform and empower organizations by finding and exploiting market inefficiencies and potentially hidden opportunities, but this is often an expensive, tedious process. However, many steps can be automated to provide a streamlined experience for data scientists. Eduardo Arino de la Rubia explores the tools being created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation and impact validation. The promise of the automated statistician is almost as old as statistics itself. From the creations of vast tables, which saved the labor of calculation, to modern tools which automatically mine datasets for correlations, there has been a considerable amount of advancement in this field. Eduardo compares and contrasts a number of open source tools, including TPOT and auto-sklearn for automated model generation and scikit-feature for feature generation and other aspects of the data science workflow, evaluates their results, and discusses their place in the modern data science workflow. Along the way, Eduardo outlines the pitfalls of automated data science and applications of the “no free lunch” theorem and dives into alternate approaches, such as end-to-end deep learning, which seek to leverage massive-scale computing and architectures to handle automatic generation of features and advanced models.

Leveraging Open Source Automated Data Science Tools

Domino Data Lab

Domino and AWS: collaborative analytics and model governance at financial ser...

Domino Data Lab

The Role and Importance of Curiosity in Data Science

Domino Data Lab

More from Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...

The Proliferation of New Database Technologies and Implications for Data Scie...

Racial Bias in Policing: an analysis of Illinois traffic stops data

Data Quality Analytics: Understanding what is in your data, before using it

Supporting innovation in insurance with randomized experimentation

Leveraging Data Science in the Automotive Industry

Summertime Analytics: Predicting E. coli and West Nile Virus

Reproducible Dashboards and other great things to do with Jupyter

GeoViz: A Canvas for Data Science

Managing Data Science | Lessons from the Field

Doing your first Kaggle (Python for Big Data sets)

Leveraged Analytics at Scale

How I Learned to Stop Worrying and Love Linked Data

Software Engineering for Data Scientists

Making Big Data Smart

Moving Data Science from an Event to A Program: Considerations in Creating Su...

Building Data Analytics pipelines in the cloud using serverless technology

Leveraging Open Source Automated Data Science Tools

Domino and AWS: collaborative analytics and model governance at financial ser...

The Role and Importance of Curiosity in Data Science

Recently uploaded

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

GenCyber Cyber Security Day Presentation

Michael W. Hawkins

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Developing An App To Navigate The Roads of Brazil

V3cube

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

Finology Group – Insurtech Innovation Award 2024

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Partners Life - Insurer Innovation Award 2024

How to Troubleshoot Apps for the Modern Connected Worker

Advantages of Hiring UIUX Design Service Providers for Your Business

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

GenCyber Cyber Security Day Presentation

Strategies for Landing an Oracle DBA Job as a Fresher

What Are The Drone Anti-jamming Systems Technology?

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Developing An App To Navigate The Roads of Brazil

Handwritten Text Recognition for manuscripts and early printed texts

Apidays New York 2024 - The value of a flexible API Management solution for O...

🐬 The future of MySQL is Postgres 🐘

ThinkFast: Scaling Machine Learning to Modern Demands

1. ThinkFast: Scaling Machine Learning to Modern Demands Hristo Paskov

2. The Genomic Data Deluge • Precision Medicine Initiative: sequence 1,000,000 genomes – $215 Million in 2015 – Pilot study – Outputs 10-50 GB/person How do we analyze all of this data to drive progress?

3. Massive Data Sources News eCommerce Bioinformatics 100K Genomes Social Media

4. The Analysis Refinement Cycle ⨂ Data 1 2 𝑦 − 𝑋𝑤 2 2 + 𝜆 2 𝑤 2 2 Model 𝑥+ = 𝑥 − 𝛼𝑀𝛻𝑓 𝑥 Solver Model captures data nuance? Solver exists, is fast enough? Yes? Proceed ! No? Quit Increase time, money, experience, resources

5. More Than Just Training Models • Regularization paths • Model risk assessment • Interpretability ModelCoefficient Regularization Parameter

6. Brief History of Statistical Learning Interpretability & Statistical Guarantees Scalability Ease of Use Simple Models Kernel Methods Trees & Ensembles Structured Regularization

7. Structured Regularization Losses Regression Classification Ranking Motif Finding Matrix Factorization Feature Embedding Data Imputation … Regularizers Sparsity Spatial/ Temporal / Manifold Structure Group Structure Hierarchical Structure Structured & Unstructured Multitask Learning … min 𝛽∈ℝ 𝑑 𝐿 𝑋𝛽 + 𝜆𝑅 𝛽

8. The Lasso’s Combinatorial Side min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 𝜆 0 3 2 1 4 ModelCoefficient

9. The Database Perspective min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 −𝑋 𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1

10. The Database Perspective −𝑋 𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1 Feature & label storage

11. The Database Perspective −𝑋 𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1 Feature & label storage Data access operations 𝑢 = 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣

12. The Database Perspective −𝑋 𝑇 𝜕 𝑦−𝑋𝛽 𝐿 𝑦 − 𝑋𝛽 + 𝜆𝜕 𝛽 𝛽 1 Feature & label storage Data access operations 𝑢 = 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣 ML “Query Language” min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1

13. The Database Perspective min 𝛽1,𝛽2,𝛽3∈ℝ 𝑑 𝑡=1 3 𝐿 𝑡 𝑦𝑡 − 𝑋𝑡 𝛽𝑡 + 𝜆 𝑡 𝑅𝑡 𝛽𝑡 +𝜔 𝛽1 𝛽2 𝛽3 ∗

14. The Database Perspective Feature, label and model storage Data access operations 𝑢 = 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣 ML “Query Language” min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 𝑀1 𝑀2 𝑀1 𝑀2 𝑀3 𝑀1 𝑀2

15. The Database Perspective 𝑢 = 𝑦 − 𝑋𝛽 𝑣 = 𝜕 𝑢 𝐿 𝑢 𝑤 = 𝑋 𝑇 𝑣 min 𝛽∈ℝ 𝑑 𝐿 𝑦 − 𝑋𝛽 + 𝜆 𝛽 1 𝑀1 𝑀2 𝑀1 𝑀2 𝑀3 𝑀1 𝑀2 Processing Memory Mathematical Structure

16. Efficient Feature Storage

17. “Query Language” Optimization • Static analysis 𝑦 − 𝑋𝑤 2 2 + 𝑤 2 2 𝑦 − 𝑋𝑤 2 2 + 𝑤 1 ? 𝑦 − 𝑋𝑤 2 2 + 1 2 𝑤 2 2 + 𝑤 1

18. “Query Language” Optimization • Static analysis 𝑦 − 𝑋𝑤 2 2 + 𝑤 2 2 𝑦 − 𝑋𝑤 2 2 + 𝑤 1 𝑦 − 𝑋𝑤 2 2 + 1 2 𝑤 2 2 + 𝑤 1 ? 𝜀 𝑦 − 𝑋𝑤 + 1 2 𝑤 2 2 + 𝑤 1

19. “Query Language” Optimization • Static analysis • Runtime analysis

20. Some Bioinformatics Applications • Personalized medicine, Memorial Sloan Kettering Cancer Center – 35% accuracy improvement over state-of-the-art • Metagenomic binning and DNA quality assessment, Stanford School of Medicine – Previously unsolved problem • Toxicogenomic analysis, Stanford University – Improved on state-of-the-art results

21. Upcoming • Massive scale character level sentiment and text analysis on Amazon data – Billions of features, hours to solve a model – Efficient multitask learning • Characterize the global limitations of learning word structure – Devise provably more efficient regularizers for uncovering structure

Editor's Notes

[Tons of data, show graph?] [Models are not good] [Howe do we quickly iterate with different models] [Memory $$$]

ThinkFast: Scaling Machine Learning to Modern Demands

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to ThinkFast: Scaling Machine Learning to Modern Demands

Similar to ThinkFast: Scaling Machine Learning to Modern Demands (20)

More from Domino Data Lab

More from Domino Data Lab (20)

Recently uploaded

Recently uploaded (20)

ThinkFast: Scaling Machine Learning to Modern Demands

Editor's Notes