WhizzML is a domain-specific language for automating Machine Learning workflows, implement high-level Machine Learning algorithms, and easily share them with others. WhizzML offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists reduce the burden of repetitive and time-consuming analytics tasks.
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Spark Schema For Free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
GE is a world leader in the manufacture of commercial jet engines, offering products for many of the best-selling commercial airframes. With more than 33,000 engines in service, GE Aviation has a history of developing analytics for monitoring its commercial engines fleets. In recent years, GE Aviation Digital has developed advanced analytic solutions for engine monitoring, with the target of improving detection and reducing false alerts, when compared to conventional analytic approaches. The advanced analytics are implemented in a real-time monitoring system which notifies GE’s Fleet Support team on a per flight basis. These analytics are developed and validated using large, historical datasets.
Analytic tools such as SQL Server and MATLAB were used until recently, when GE’s data was moved to an Apache Spark environment. Consequently, our advanced analytics are now being migrated to Spark, where there should also be performance gains with bigger data sets. In this talk we will share experiences of converting our advanced algorithms to custom Spark ML pipelines, as well as outlining various case studies.
With Honor Powrie and Peter Knight
GraphFrames Access Methods in DSE GraphJim Hatcher
GraphFrames is a powerful feature in Spark that allows you to harness Spark's distributed computing framework to operate on your Graph. Tasks like data ingestion, schema migrations, and analytical jobs can all be run against your Graph. In DSE Graph, there are several methods to leverage GraphFrames including Gremlin, Spark SQL, and Motif. This presentation walks through the basics of using GraphFrames with DSE Graph; then shows how these different methods can be used and how you can evaluate which one is the best for your use case.
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Spark Schema For Free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
GE is a world leader in the manufacture of commercial jet engines, offering products for many of the best-selling commercial airframes. With more than 33,000 engines in service, GE Aviation has a history of developing analytics for monitoring its commercial engines fleets. In recent years, GE Aviation Digital has developed advanced analytic solutions for engine monitoring, with the target of improving detection and reducing false alerts, when compared to conventional analytic approaches. The advanced analytics are implemented in a real-time monitoring system which notifies GE’s Fleet Support team on a per flight basis. These analytics are developed and validated using large, historical datasets.
Analytic tools such as SQL Server and MATLAB were used until recently, when GE’s data was moved to an Apache Spark environment. Consequently, our advanced analytics are now being migrated to Spark, where there should also be performance gains with bigger data sets. In this talk we will share experiences of converting our advanced algorithms to custom Spark ML pipelines, as well as outlining various case studies.
With Honor Powrie and Peter Knight
GraphFrames Access Methods in DSE GraphJim Hatcher
GraphFrames is a powerful feature in Spark that allows you to harness Spark's distributed computing framework to operate on your Graph. Tasks like data ingestion, schema migrations, and analytical jobs can all be run against your Graph. In DSE Graph, there are several methods to leverage GraphFrames including Gremlin, Spark SQL, and Motif. This presentation walks through the basics of using GraphFrames with DSE Graph; then shows how these different methods can be used and how you can evaluate which one is the best for your use case.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
General exploration tips that I taught in both Python and R at Houston Data Analytics. Please take a look and enjoy. The codes marked in red are the main execution code for the step discussed on the slide. If you have any questions about interpretation methodology, please feel free to send me a note. These are meant to be quick coding applications to get quick results:
Python for R developers and data scientistsLambda Tree
This is an introductory talk aimed at data scientists who are well versed with R but would like to work with Python as well. I will cover common workflows in R and how they translate into Python. No Python experience necessary.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
Using xUnit as a Swiss-Aarmy Testing ToolkitChris Oldwood
Modern Unit Testing practices act as a conduit for improved software designs that are more amenable to change and can be easily backed by automation for fast feedback on quality assurance. The necessity of reducing external dependencies forces us to design our modules with minimum coupling which can then be leveraged both at the module, component and subsystem levels in our testing. As we start to integrate our units into larger blocks and interface our resulting components with external systems we find ourselves switching nomenclature as we progress from Unit to Integration testing. But is a change in mindset and tooling really required?
The xUnit testing framework is commonly perceived as an aid to Unit Testing but the constraints that it imposes on the architecture mean that it is an excellent mechanism for invoking arbitrary code in a restricted context. Tests can be partitioned by categorisation at the test and fixture level and through physical packaging leading to a flexible test code structure. Throw in its huge popularity and you have a simplified learning curve for expressing more that just unit tests.
Using scenarios from his current system Chris aims to show how you can use a similar format and tooling for unit, component and integration level tests; albeit with a few liberties taken to work around the inherent differences with each methodology.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Data Con LA
This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.
This is a very ^2 basic introduction to R.
The purpose of this presentation is to prepare you with all that you have to know about fundamentals of using R to operate data frames, which you can easily get by importing data from relational database table or csv/text file.
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
Abstract:-
Data engineering at Dollar Shave Club has grown significantly over the last year. In that time, it has expanded in scope from conventional web-analytics and business intelligence to include real-time, big data and machine learning applications. We have bootstrapped a dedicated data engineering team in parallel with developing a new category of capabilities. And the business value that we delivered early on has allowed us to forge new roles for our data products and services in developing and carrying out business strategy. This progress was made possible, in large part, by adopting Apache Spark as an application framework. This talk describes what we have been able to accomplish using Spark at Dollar Shave Club.
Bio:-
Brett Bevers, Ph.D. Brett is a backend engineer and leads the data engineering team at Dollar Shave Club. More importantly, he is an ex-academic who is driven to understand and tackle hard problems. His latest challenge has been to develop tools powerful enough to support data-driven decision making in high value projects.
Thinking of getting a breast reduction here are some pros and consHealth First
A breast reduction surgery is done to make your breasts look more proportionate as compared to the rest of your physique and it is also done when there is discomfort caused due to them. It is not always true, that beautiful breast means bigger breasts.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
General exploration tips that I taught in both Python and R at Houston Data Analytics. Please take a look and enjoy. The codes marked in red are the main execution code for the step discussed on the slide. If you have any questions about interpretation methodology, please feel free to send me a note. These are meant to be quick coding applications to get quick results:
Python for R developers and data scientistsLambda Tree
This is an introductory talk aimed at data scientists who are well versed with R but would like to work with Python as well. I will cover common workflows in R and how they translate into Python. No Python experience necessary.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
Using xUnit as a Swiss-Aarmy Testing ToolkitChris Oldwood
Modern Unit Testing practices act as a conduit for improved software designs that are more amenable to change and can be easily backed by automation for fast feedback on quality assurance. The necessity of reducing external dependencies forces us to design our modules with minimum coupling which can then be leveraged both at the module, component and subsystem levels in our testing. As we start to integrate our units into larger blocks and interface our resulting components with external systems we find ourselves switching nomenclature as we progress from Unit to Integration testing. But is a change in mindset and tooling really required?
The xUnit testing framework is commonly perceived as an aid to Unit Testing but the constraints that it imposes on the architecture mean that it is an excellent mechanism for invoking arbitrary code in a restricted context. Tests can be partitioned by categorisation at the test and fixture level and through physical packaging leading to a flexible test code structure. Throw in its huge popularity and you have a simplified learning curve for expressing more that just unit tests.
Using scenarios from his current system Chris aims to show how you can use a similar format and tooling for unit, component and integration level tests; albeit with a few liberties taken to work around the inherent differences with each methodology.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Data Con LA
This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.
This is a very ^2 basic introduction to R.
The purpose of this presentation is to prepare you with all that you have to know about fundamentals of using R to operate data frames, which you can easily get by importing data from relational database table or csv/text file.
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
Abstract:-
Data engineering at Dollar Shave Club has grown significantly over the last year. In that time, it has expanded in scope from conventional web-analytics and business intelligence to include real-time, big data and machine learning applications. We have bootstrapped a dedicated data engineering team in parallel with developing a new category of capabilities. And the business value that we delivered early on has allowed us to forge new roles for our data products and services in developing and carrying out business strategy. This progress was made possible, in large part, by adopting Apache Spark as an application framework. This talk describes what we have been able to accomplish using Spark at Dollar Shave Club.
Bio:-
Brett Bevers, Ph.D. Brett is a backend engineer and leads the data engineering team at Dollar Shave Club. More importantly, he is an ex-academic who is driven to understand and tackle hard problems. His latest challenge has been to develop tools powerful enough to support data-driven decision making in high value projects.
Thinking of getting a breast reduction here are some pros and consHealth First
A breast reduction surgery is done to make your breasts look more proportionate as compared to the rest of your physique and it is also done when there is discomfort caused due to them. It is not always true, that beautiful breast means bigger breasts.
WhizzML is a domain-specific language for automating Machine Learning workflows, implement high-level Machine Learning algorithms, and easily share them with others. WhizzML offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists reduce the burden of repetitive and time-consuming analytics tasks.
WhizzML is a domain-specific language for automating Machine Learning workflows, implement high-level Machine Learning algorithms, and easily share them with others. WhizzML offers out-of-the-box scalability, abstracts away the complexity of underlying infrastructure, and helps analysts, developers, and scientists reduce the burden of repetitive and time-consuming analytics tasks.
Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 4: Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking
Lecturer: Dr. José Antonio Ortega - jao (BigML)
안드로이드 앱의 빌드 툴인 Gradle은 빌드 스크립트로 Groovy DSL을 사용합니다. DSL은 큰 학습 없이 사용할 수 있다는 장점이 있지만, 각 구문의 역할이나 의미를 정확히 알기는 어렵습니다. 이에 대한 기본적인 이해가 생긴다면, 직접 코드를 추가하거나 기존 기능을 개선시키는 등 적극적으로 빌드 과정을 조절하는데 도움이 될 것입니다. 이 세션에서는 Groovy 문법을 간단하게 소개하고, Gradle 파일들을 구문별로 의미하는 바를 짚어보며 gradle 파일을 읽는 법을 알아봅니다. 그리고 원하는 기능을 찾기 위해서 어떤 레퍼런스를 찾아야 하는지에 대한 팁을 공유합니다.
NoSQL - MongoDB. Agility, scalability, performance. I am going to talk about the basis of NoSQL and MongoDB. Why some projects requires RDBMs and another NoSQL databases? What are the pros and cons to use NoSQL vs. SQL? How data are stored and transefed in MongoDB? What query language is used? How MongoDB supports high availability and automatic failover with the help of the replication? What is sharding and how it helps to support scalability?. The newest level of the concurrency - collection-level and document-level.
New feature overview of Cubes 1.0 – lightweight Python OLAP and pluggable data warehouse. Video: https://www.youtube.com/watch?v=-FDTK80zsXc Github sources: https://github.com/databrewery/cubes
This ppt provide information about:
1. Database basics,
2. Indexes,
3. PHP MyAdmin Connect & Pconnect,
4. MySQL Create,
5. MySQL Insert,
6. MySQL Select,
7. MySQL Update,
8. MySQL Delete,
9. MySQL Truncate,
10. MySQL Drop
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
Raquel Guimaraes- Third party infrastructure as codeThoughtworks
While implementing cloud Infrastructure as Code you might have come across the problem of dealing with third-party resources. This is most common in complex environments where most of the resources live in a cloud provider (GCP or AWS for example) and there are some SaaS solutions to integrate with (Datadog and Pingdom for example). In this talk we will expose the problem and explain a solution that is currently being used by one of our key clients in Spain.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Outline
1 What is WhizzML?
2 WhizzML Server-side Resources
3 WhizzML Language Basics
4 Standard Library Overview
5 Tutorial Walkthrough: Model or Ensemble?
The BigML Team Basic WhizzML Workflows May 2016 2 / 24
3. Outline
1 What is WhizzML?
2 WhizzML Server-side Resources
3 WhizzML Language Basics
4 Standard Library Overview
5 Tutorial Walkthrough: Model or Ensemble?
The BigML Team Basic WhizzML Workflows May 2016 3 / 24
4. WhizzML in a Nutshell
• Domain-specific language for ML workflow automation
High-level problem and solution specification
• Framework for scalable, remote execution of ML workflows
Sophisticated server-side optimization
Out-of-the-box scalability
Client-server brittleness removed
Infrastructure for creating and sharing ML scripts and libraries
The BigML Team Basic WhizzML Workflows May 2016 4 / 24
5. Outline
1 What is WhizzML?
2 WhizzML Server-side Resources
3 WhizzML Language Basics
4 Standard Library Overview
5 Tutorial Walkthrough: Model or Ensemble?
The BigML Team Basic WhizzML Workflows May 2016 5 / 24
6. WhizzML REST Resources
Library Reusable building-block: a collection of
WhizzML definitions that can be imported by
other libraries or scripts.
Script Executable code that describes an actual
workflow.
• Imports List of libraries with code used by
the script.
• Inputs List of input values that
parameterize the workflow.
• Outputs List of values computed by the
script and returned to the user.
Execution Given a script and a complete set of inputs,
the workflow can be executed and its outputs
generated.
The BigML Team Basic WhizzML Workflows May 2016 6 / 24
7. Outline
1 What is WhizzML?
2 WhizzML Server-side Resources
3 WhizzML Language Basics
4 Standard Library Overview
5 Tutorial Walkthrough: Model or Ensemble?
The BigML Team Basic WhizzML Workflows May 2016 7 / 24
8. Basic Syntax
Atomic constants
"a string value"
23, -10, -1.23E11, 1.42342
true, false
Fully parenthesized prefix notation
(list-sources) ;; Function call without arguments
(log-info "Hello World!")
(* 2 (+ 2 3)) ;; Evaluates to 2 * (2 + 3)
(atan (tan 3)) ;; Nested function calls
The BigML Team Basic WhizzML Workflows May 2016 8 / 24
12. Functions
Defining a function
(define (function-name arg1 arg2 ...)
body)
Examples
(define (add-numbers x y)
(+ x y))
(define (create-model-and-ensemble dataset-id)
(create-model {"dataset" dataset-id})
(create-ensemble {"dataset" dataset-id
"number_of_models" 10}))
The BigML Team Basic WhizzML Workflows May 2016 12 / 24
13. Local variables
Let bindings
(let (name-1 val-1
name-2 val-2
...)
body)
Example:
(define no-of-models 10)
(let (msg "I am creating "
id "dataset/570861ecb85eee0472000016")
;; here msg, id and no-of-models are bound
(log-info msg no-of-models)
(create-ensemble {"dataset" id
"number_of_models" no-of-models}))
;;; here msg and id are *not* bound
The BigML Team Basic WhizzML Workflows May 2016 13 / 24
14. Conditionals
if
(if (> x 0) ;; condition
"x is positive" ;; consequent
"x is not positive") ;; alternative
when
(when (positive? n)
(log-info "Creating a few models...")
(create-lots-of-models n))
The BigML Team Basic WhizzML Workflows May 2016 14 / 24
15. Conditionals
cond
;; Nested conditionals
(if (> x 3)
"big"
(if (< x 1)
"small"
"standard"))
;; are better with cond:
(cond (> x 3) "big"
(< x 1) "small"
"standard")
The BigML Team Basic WhizzML Workflows May 2016 15 / 24
16. Error handling
Signaling errors
(raise {"message" "Division by zero" "code" -10})
Catching errors
(try (/ 42 x)
(catch e
(log-warn "I've got an error with message: "
(get e "message")
" and code "
(get e "code"))))
The BigML Team Basic WhizzML Workflows May 2016 16 / 24
17. Demo: a simple script
Create dataset and return its row number
(define (make-dataset id name)
(let (ds-id (create-and-wait-dataset {"source" id
"name" name}))
(fetch ds-id)))
(define dataset (make-dataset source-id source-name))
(define dataset-id (get dataset "resource"))
(define rows (get dataset "rows"))
https://gist.github.com/whizzmler/917a05cf6c173381116e3cc02da70e42
The BigML Team Basic WhizzML Workflows May 2016 17 / 24
18. Outline
1 What is WhizzML?
2 WhizzML Server-side Resources
3 WhizzML Language Basics
4 Standard Library Overview
5 Tutorial Walkthrough: Model or Ensemble?
The BigML Team Basic WhizzML Workflows May 2016 18 / 24
19. Standard functions
• Numeric and relational operators (+, *, <, =, ...)
• Mathematical functions (cos, sinh, floor ...)
• Strings and regular expressions (str, matches?, replace, ...)
• Flatline generation
• Collections: list traversal, sorting, map manipulation
• BigML resources manipulation
Creation create-source, create-and-wait-dataset, etc.
Retrieval fetch, list-anomalies, etc.
Update update
Deletion delete
• Machine Learning Algorithms (SMACDown, Boosting, etc.)
The BigML Team Basic WhizzML Workflows May 2016 19 / 24
20. Outline
1 What is WhizzML?
2 WhizzML Server-side Resources
3 WhizzML Language Basics
4 Standard Library Overview
5 Tutorial Walkthrough: Model or Ensemble?
The BigML Team Basic WhizzML Workflows May 2016 20 / 24
21. Model or Ensemble?
• Split a dataset in test and training parts
• Create a model and an ensemble with the training dataset
• Evaluate both with the test dataset
• Choose the one with better evaluation (f-measure)
https://github.com/whizzml/examples/tree/master/model-or-ensemble
The BigML Team Basic WhizzML Workflows May 2016 21 / 24
22. Model or Ensemble?
;; Functions for creating the two dataset parts
;; and the model and ensemble from the training set.
(define (sample-dataset ds-id rate oob)
(create-and-wait-dataset {"sample_rate" rate
"origin_dataset" ds-id
"out_of_bag" oob
"seed" "whizzml-example"}))
(define (split-dataset ds-id rate)
(list (sample-dataset ds-id rate false)
(sample-dataset ds-id rate true)))
(define (make-model ds-id)
(create-and-wait-model {"dataset" ds-id}))
(define (make-ensemble ds-id size)
(create-and-wait-ensemble {"dataset" ds-id
"number_of_models" size}))
The BigML Team Basic WhizzML Workflows May 2016 22 / 24
23. Model or Ensemble?
;; Functions for evaluating model and ensemble
;; using the test set, and to extract f-measure from
;; the evaluation results
(define (evaluate-model model-id ds-id)
(create-and-wait-evaluation {"model" model-id
"dataset" ds-id}))
(define (evaluate-ensemble model-id ds-id)
(create-and-wait-evaluation {"ensemble" model-id
"dataset" ds-id}))
(define (f-measure ev-id)
(get-in (fetch ev-id) ["result" "model" "average_f_measure"]))
The BigML Team Basic WhizzML Workflows May 2016 23 / 24
24. Model or Ensemble?
;; Function encapsulating the full workflow
(define (model-or-ensemble src-id)
(let (ds-id (create-and-wait-dataset {"source" src-id})
;; ^ full dataset
ids (split-dataset ds-id 0.8) ;; split it 80/20
train-id (nth ids 0) ;; the 80% for training
test-id (nth ids 1) ;; and 20% for evaluations
m-id (make-model train-id) ;; create a model
e-id (make-ensemble train-id 15) ;; and an ensemble
m-f (f-measure (evaluate-model m-id test-id)) ;; evaluate
e-f (f-measure (evaluate-ensemble e-id test-id)))
(log-info "model f " m-f " / ensemble f " e-f)
(if (> m-f e-f) m-id e-id)))
;; Compute the result of the script execution
;; - Inputs: [{"name": "input-source-id", "type": "source-id"}]
;; - Outputs: [{"name": "result", "type": "resource-id"}]
(define result (model-or-ensemble input-source-id))
The BigML Team Basic WhizzML Workflows May 2016 24 / 24