The document provides guidance on leveling up a company's data infrastructure and analytics capabilities. It recommends starting by acquiring and storing data from various sources in a data warehouse. The data should then be transformed into a usable shape before performing analytics. When setting up the infrastructure, the document emphasizes collecting user requirements, designing the data warehouse around key data aspects, and choosing technology that supports iteration, extensibility and prevents data loss. It also provides tips for creating effective dashboards and exploratory analysis. Examples of implementing this approach for two sample companies, MESI and SalesGenomics, are discussed.
Having programmers do data science is terrible, if only everyone else were not even worse. The problem is of course tools. We seem to have settled on either: a bunch of disparate libraries thrown into a more or less agnostic IDE, or some point-and-click wonder which no matter how glossy, never seems to truly fit our domain once we get down to it. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way.
This presentation is a meditation on how I approach data problems with Clojure, what I believe the process of doing data science should look like and the tools needed to get there. Some already exist (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
Clojure is fantastic for data manipulation and rapid prototyping, but falls short when it comes to communicating your insights. What is lacking are good visualization libraries and (shareable) notebook-like environments. I'll show my workflow in org-babel which weaves Clojure with R (for ggplot) and Python (for scikit-learn) and tell you why it's wrong, how IPythons of the world have trapped us in a local maximum and how we need a reconceptualization similar to what a REPL does to programming. All this interposed with my experience doing data science with Clojure (everything from ETL to on-the-spot analysis during a brainstorming).
Beyond Data Discovery: The Value Unlocked by Modern Data ModelingLooker
In this webinar we will discuss Looker’s novel approach to data modeling and how it powers a data exploration environment with unprecedented depth and agility.
Some topics we will cover:
-A new architecture beyond direct connect
-Language-based, git-integrated data modeling
-Abstractions that make SQL more powerful and more efficient
The current data landscape, key trends, and the future of business intelligence and data analytics. Time to modernize your data analytics? Start your free Looker trial today.
Having programmers do data science is terrible, if only everyone else were not even worse. The problem is of course tools. We seem to have settled on either: a bunch of disparate libraries thrown into a more or less agnostic IDE, or some point-and-click wonder which no matter how glossy, never seems to truly fit our domain once we get down to it. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way.
This presentation is a meditation on how I approach data problems with Clojure, what I believe the process of doing data science should look like and the tools needed to get there. Some already exist (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
Clojure is fantastic for data manipulation and rapid prototyping, but falls short when it comes to communicating your insights. What is lacking are good visualization libraries and (shareable) notebook-like environments. I'll show my workflow in org-babel which weaves Clojure with R (for ggplot) and Python (for scikit-learn) and tell you why it's wrong, how IPythons of the world have trapped us in a local maximum and how we need a reconceptualization similar to what a REPL does to programming. All this interposed with my experience doing data science with Clojure (everything from ETL to on-the-spot analysis during a brainstorming).
Beyond Data Discovery: The Value Unlocked by Modern Data ModelingLooker
In this webinar we will discuss Looker’s novel approach to data modeling and how it powers a data exploration environment with unprecedented depth and agility.
Some topics we will cover:
-A new architecture beyond direct connect
-Language-based, git-integrated data modeling
-Abstractions that make SQL more powerful and more efficient
The current data landscape, key trends, and the future of business intelligence and data analytics. Time to modernize your data analytics? Start your free Looker trial today.
Software Analytics for Pragmatists [DevOps Camp 2017]Markus Harrer
Talk at DevOps Camp 2017, Nürnberg, 13.05.2017
Each step in the development or use of software leaves valuable, digital tracks. The analysis of this "software data" (such as runtime measures, log files or commits) refines our gut feeling to facts with sound evidence.
I'll show how questions that arise in software development can be answered automated, data-driven and reproducible. I demonstrate the interaction of open source analysis tools (such as jQAssistant, Neo4j, Pandas, and Jupyter) for the analysis of data from different sources (such as JProfiler, Jenkins, and Git). Together, we have a look at how we can develop solutions to optimize performance, identify build breaker or make knowledge gaps in our source code visible.
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
Dataiku productive application to production - pap is may 2015 Dataiku
Beyond Predictive Analytics : Deploying apps to production and keep them improving
Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.
As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.
Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…
Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production
Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?
Frank Bien, CEO of Looker - along with Amazon, Google and other data disrupters - discuss how innovators are deeply integrating analytics into every aspect of their businesses, from mobile to warehouse to cloud.
Frank shares Looker’s vision for the future of business intelligence and data analytics and reveal pivotal product and partnership updates.
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
How can we use technology to help the organization make data-driven decision-making part of its organizational DNA, while retaining the context of the business as a whole? How can we imprint data in the culture of the organization and make it easily accessible to everyone? Microsoft directly empowers businesses to derive insights and value from little and big data, through its release of user-friendly analytics through Azure Machine Learning (ML) combined with its acquisition of Revolution Analytics. Power BI can be used to create compelling visual stories around the analysis so that the work is not left to the data consumer. Together, these technologies can be used to make data and analytics part of the organization's DNA.
There are no prerequisites, but attendees are welcome to follow along with the demo if they have an Azure ML and Power BI account and R installed. Files will be released before the session.
What are actionable insights? (Introduction to Operational Analytics Software)Newton Day Uploads
What Are Actionable Insights? In this presentation I outline what Actionable Insights are and the Operational Analytics Software that can produce them. And because Business Intelligence and the Business Intelligence Software market can be so confusing for buyers I've attempted to position where Actionable Insights and Operational Analytics fit in the Business Intelligence 'story'.
Producing direct value for businesses via quantitative models.
New analytical tools such as Looker allow data analysts to speed up the dirty work around building data models—making it less painful to clean data, explore predictive factors, and evaluate results.
In this educational webinar from Data Science Central (DSC), Justin Palmer of LendingHome, a mortgage banking and marketing platform, joins Colin Zima, Chief Analytics Officer at Looker. Using a public-domain FAA dataset and the LendingHome platform as examples, they dig into the data modeling process and offer ideas for improvements.
- See more at: http://try.looker.com/resources/improving-data-modeling-workflow#sthash.2rGxwhJ7.dpuf
Presented by BrainSell, a top Sage ERP partner, www.brainsell.net
Sage Intelligence works with Sage 100, Sage 300, Sage 500 and more. See how it can make your life better today!
www.brainsell.net
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby
Presented by Yashas Vaidya, Sr Data Scientist at DataIku
Next DSS MIA Event - https://datascience.salon/miami/
The steps to taking a machine learning model to production. Modern architectures and technologies for building production machine learning. An overview of the talent and processes for creating and maintaining production machine learning.
Synapse is a solution provider with an innovative alternative to commercial off-the-shelf IT applications. Empowering business professionals to shape business processes without being chained to IT applications.
When it comes to creating an enterprise AI strategy: if your company isn’t good at analytics, it’s not ready for AI. Succeeding in AI requires being good at data engineering AND analytics. Unfortunately, management teams often assume they can leapfrog best practices for basic data analytics by directly adopting advanced technologies such as ML/AI – setting themselves up for failure from the get-go. This presentation explains how to get basic data engineering and the right technology in place to create and maintain data pipelines so that you can solve problems with AI successfully.
When and Where to Embed Business IntelligenceLooker
Watch the recorded webinar at http://bit.ly/1MeX7QK
Everywhere you look, companies are using external-facing analytics to maximize the value derived from their data assets, by moving customers up the value chain, increasing stickiness, and offering a more competitive product on the marketplace.
Listen to learn about how to bring an external-facing data product to market by embedding BI software, and what that can add to your offering.
Presentation covers:
-Top uses cases for embedding business intelligence software
Case studies from different companies currently embedding BI
-Build vs buy considerations
-Evaluating ROI
Stop refreshing vanity metrics & start focusing on the metrics that inform de...Looker
Stop Refreshing Vanity Metrics & Start Focusing on the Metrics that Inform Decisions
There is a propensity to focus on vanity metrics; metrics that show you the score: How many new views, new daily active users, how much revenue last week. You may slice these by different attributes - geography, platform, user demographics. While this can help you understand the high level trends in your business, it does little to tell you how to get better.
This slide deck looks at how vanity metrics can distract you from focusing on the analysis that matters, which is identifying and measuring the metrics that drive decisions. There are several real examples of how companies (Venmo, Simply Business, and Looker) have used event data in highly customized ways to make better decisions about their products.
Many companies have invested time and money into building sophisticated data pipelines that can move massive amounts of data, often in real time. However, for the analyst or data scientist who builds offline models, integrating their analyses into these pipelines for operational purposes can pose a challenge.
In this slide deck, we will discuss some key technologies and workflows companies can leverage to build end-to-end solutions for automating statistical and machine learning solutions: from collection and storage to analysis and real-time predictions.
This presentation has been uploaded by Public Relations Cell, IIM Rohtak to help the B-school aspirants crack their interview by gaining basic knowledge on IT.
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
Data is the foundation of any meaningful corporate initiative. Fully master the necessary data, and you’re more than halfway to success. That’s why leverageable (i.e., multiple use) artifacts of the enterprise data environment are so critical to enterprise success.
Build them once (keep them updated), and use again many, many times for many and diverse ends. The data warehouse remains focused strongly on this goal. And that may be why, nearly 40 years after the first database was labeled a “data warehouse,” analytic database products still target the data warehouse.
Software Analytics for Pragmatists [DevOps Camp 2017]Markus Harrer
Talk at DevOps Camp 2017, Nürnberg, 13.05.2017
Each step in the development or use of software leaves valuable, digital tracks. The analysis of this "software data" (such as runtime measures, log files or commits) refines our gut feeling to facts with sound evidence.
I'll show how questions that arise in software development can be answered automated, data-driven and reproducible. I demonstrate the interaction of open source analysis tools (such as jQAssistant, Neo4j, Pandas, and Jupyter) for the analysis of data from different sources (such as JProfiler, Jenkins, and Git). Together, we have a look at how we can develop solutions to optimize performance, identify build breaker or make knowledge gaps in our source code visible.
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
Dataiku productive application to production - pap is may 2015 Dataiku
Beyond Predictive Analytics : Deploying apps to production and keep them improving
Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.
As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.
Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…
Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production
Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?
Frank Bien, CEO of Looker - along with Amazon, Google and other data disrupters - discuss how innovators are deeply integrating analytics into every aspect of their businesses, from mobile to warehouse to cloud.
Frank shares Looker’s vision for the future of business intelligence and data analytics and reveal pivotal product and partnership updates.
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
How can we use technology to help the organization make data-driven decision-making part of its organizational DNA, while retaining the context of the business as a whole? How can we imprint data in the culture of the organization and make it easily accessible to everyone? Microsoft directly empowers businesses to derive insights and value from little and big data, through its release of user-friendly analytics through Azure Machine Learning (ML) combined with its acquisition of Revolution Analytics. Power BI can be used to create compelling visual stories around the analysis so that the work is not left to the data consumer. Together, these technologies can be used to make data and analytics part of the organization's DNA.
There are no prerequisites, but attendees are welcome to follow along with the demo if they have an Azure ML and Power BI account and R installed. Files will be released before the session.
What are actionable insights? (Introduction to Operational Analytics Software)Newton Day Uploads
What Are Actionable Insights? In this presentation I outline what Actionable Insights are and the Operational Analytics Software that can produce them. And because Business Intelligence and the Business Intelligence Software market can be so confusing for buyers I've attempted to position where Actionable Insights and Operational Analytics fit in the Business Intelligence 'story'.
Producing direct value for businesses via quantitative models.
New analytical tools such as Looker allow data analysts to speed up the dirty work around building data models—making it less painful to clean data, explore predictive factors, and evaluate results.
In this educational webinar from Data Science Central (DSC), Justin Palmer of LendingHome, a mortgage banking and marketing platform, joins Colin Zima, Chief Analytics Officer at Looker. Using a public-domain FAA dataset and the LendingHome platform as examples, they dig into the data modeling process and offer ideas for improvements.
- See more at: http://try.looker.com/resources/improving-data-modeling-workflow#sthash.2rGxwhJ7.dpuf
Presented by BrainSell, a top Sage ERP partner, www.brainsell.net
Sage Intelligence works with Sage 100, Sage 300, Sage 500 and more. See how it can make your life better today!
www.brainsell.net
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby
Presented by Yashas Vaidya, Sr Data Scientist at DataIku
Next DSS MIA Event - https://datascience.salon/miami/
The steps to taking a machine learning model to production. Modern architectures and technologies for building production machine learning. An overview of the talent and processes for creating and maintaining production machine learning.
Synapse is a solution provider with an innovative alternative to commercial off-the-shelf IT applications. Empowering business professionals to shape business processes without being chained to IT applications.
When it comes to creating an enterprise AI strategy: if your company isn’t good at analytics, it’s not ready for AI. Succeeding in AI requires being good at data engineering AND analytics. Unfortunately, management teams often assume they can leapfrog best practices for basic data analytics by directly adopting advanced technologies such as ML/AI – setting themselves up for failure from the get-go. This presentation explains how to get basic data engineering and the right technology in place to create and maintain data pipelines so that you can solve problems with AI successfully.
When and Where to Embed Business IntelligenceLooker
Watch the recorded webinar at http://bit.ly/1MeX7QK
Everywhere you look, companies are using external-facing analytics to maximize the value derived from their data assets, by moving customers up the value chain, increasing stickiness, and offering a more competitive product on the marketplace.
Listen to learn about how to bring an external-facing data product to market by embedding BI software, and what that can add to your offering.
Presentation covers:
-Top uses cases for embedding business intelligence software
Case studies from different companies currently embedding BI
-Build vs buy considerations
-Evaluating ROI
Stop refreshing vanity metrics & start focusing on the metrics that inform de...Looker
Stop Refreshing Vanity Metrics & Start Focusing on the Metrics that Inform Decisions
There is a propensity to focus on vanity metrics; metrics that show you the score: How many new views, new daily active users, how much revenue last week. You may slice these by different attributes - geography, platform, user demographics. While this can help you understand the high level trends in your business, it does little to tell you how to get better.
This slide deck looks at how vanity metrics can distract you from focusing on the analysis that matters, which is identifying and measuring the metrics that drive decisions. There are several real examples of how companies (Venmo, Simply Business, and Looker) have used event data in highly customized ways to make better decisions about their products.
Many companies have invested time and money into building sophisticated data pipelines that can move massive amounts of data, often in real time. However, for the analyst or data scientist who builds offline models, integrating their analyses into these pipelines for operational purposes can pose a challenge.
In this slide deck, we will discuss some key technologies and workflows companies can leverage to build end-to-end solutions for automating statistical and machine learning solutions: from collection and storage to analysis and real-time predictions.
This presentation has been uploaded by Public Relations Cell, IIM Rohtak to help the B-school aspirants crack their interview by gaining basic knowledge on IT.
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
Data is the foundation of any meaningful corporate initiative. Fully master the necessary data, and you’re more than halfway to success. That’s why leverageable (i.e., multiple use) artifacts of the enterprise data environment are so critical to enterprise success.
Build them once (keep them updated), and use again many, many times for many and diverse ends. The data warehouse remains focused strongly on this goal. And that may be why, nearly 40 years after the first database was labeled a “data warehouse,” analytic database products still target the data warehouse.
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread.
No doubt Visualization of Data is a key component of our industry. The path data travels since it is created till it takes shape in a chart is sometimes obscure and overlooked as it tends to live in the engineering side (when volume is relevant), an area where Data Scientist tend to visit but not the usual Web/Marketing Data Analyst. Nowadays the options to tame all that journey and make the best of it are many and they don't require extensive engineering knowledge. Small or Big Data, let's see what "Store, Extract, Transform, Load, Visualize" is all about.
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and TableauDATAVERSITY
Got lots of data? So does Amaysim, a leading Australian telecom provider, with its billions of rows of data. The organization successfully empowers its small team of data analysts with self-service data analytics platforms so they can easily access the data they need, perform advanced analytics, and visualize findings for all stakeholders. Register for this session and learn how Amaysim uses the Alteryx-Redshift-Tableau BI stack to easily and quickly:
Extract data from their data warehouse and blend and enrich it with other sources
Give data analytical context by running statistical, predictive, and deep geo-spatial analytics
Create visualizations from analytics and then update Tableau Workbooks directly from Alteryx, or publish the results in Amazon Redshift, for easy direct access for their stakeholders from Tableau
Hear from Adrian Loong, Alteryx Analytics Certified Expert (ACE), and product marketers from AWS and Alteryx on how organizations can use Alteryx, Amazon Redshift and Tableau to enable data analysts to spin up new self-service analytics instances to enable fast investigation for critical business decisions.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting (1) and data analysis (2). Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.
Similar to Levelling up your data infrastructure (20)
Recommendation algorithms and their variations such as ranking are the most common way for machine learning to find its way into a product where it is not the main focus. In this talk we’ll dig into the subtleties of making recommendation algorithms a seamless and integral part of your UX (goal: it should completely fade into the background. The user should not be aware she’s interacting with any kind of machine learning, it should just feel right, perhaps smart or even a tad like cheating); how to solve the cold start problem (and having little training data in general); and how to effectively collect feedback data. I’ll be drawing from my experiences building Metabase, an open source analytics/BI tool, where we extensively use recommendations and ranking to keep users in a state of flow when exploring data; to help with discoverability; and as a way to gently teach analysis and visualization best practices; all on the way towards building an AI data scientist.
In this talk we will look at how to efficiently (in both space and time) summarize large, potentially unbounded, streams of data by approximating the underlying distribution using so-called sketch algorithms. The main approach we are going to be looking at is summarization via histograms. Histograms have a number of desirable properties: they work well in an on-line setting, are embarrassingly parallel, and are space-bound. Not to mention they capture the entire (empirical) distribution which is something that otherwise often gets lost when doing descriptive statistics. Building from that we will delve into related problems of sampling in a stream setting, and updating in a batch setting; and highlight some cool tricks such as capturing time-dynamics via data snapshotting. To finish off we will touch upon algorithms to summarize categorical data, most notably count-min sketch.
Transducers -- composable algorithmic transformation decoupled from input or output sources -- are Clojure’s take on data transformation. In this talk we will look at what makes a transducer; push their composability to the limit chasing the panacea of building complex single-pass transformations out of reusable components (eg. calculating a bunch of descriptive statistics like sum, sum of squares, mean, variance, ... in a single pass without resorting to a spaghetti ball fold); explore how the fact they are decoupled from input and output traversal opens up some interesting possibilities as they can be made to work in both online and batch settings; all drawing from practical examples of using Clojure to analize “awkward-size” data.
You have defined your metrics, setup dashboards, and started to incorporate data into your everyday. Great, but I have some bad news for you. Almost certainly some of you metrics are wrong. At best these mistakes mean that you are not getting all the insights you could have, at worse some of the conclusions you have drawn from them are wrong. In this talk we will go through the most common but pernicious mistakes and unravel the mechanisms behind them so by the end of the talk you will be equipped with an analytical toolset to spot them on your own. The main classes of errors we will cover are: viewing data as a static process; not considering error margins and variance; picking the wrong reference point; assuming your population is homogeneous; and improperly accounting for costs.
Writing correct smart contract is hard (a recent study estimated that 3% of Ethereum contracts in the wild have some sort of security vulnerability; we all know of the DAO and Parity exploits, …). There are two main reasons for this. First and foremost developing for the blockchain is quite different than what most programmers are used to. The level of concurrency is far beyond our (von Neumann) intuition and mental models. And you can’t stop and inspect running code like you can in other systems. Taken together blockchain is closer to a physical/living system than conventional software — a fact not reflected in the tools available. Compared to other domains our tooling and programming languages are somewhere between rudimentary and bad; and a far cry from where they would need to be to augment developers and help make programming for the blockchain less alien and less error prone. In this talk we will first unpack what makes programming for the blockchain hard, and what are the most common types of vulnerabilities and their causes. Then we will look at the state of art programming language research in correctness proving and programming massively concurrent systems; and how these can be applied to programming smart contracts; revisit some technologies from the past that didn’t get traction at the time, but are nevertheless worth studying; and finishing off by trying to imagine how programming for the blockchain should, and perhaps one day will, look like.
Online statistical analysis using transducers and sketch algorithmsSimon Belak
Online statistical analysis using transducers and sketch algorithms. Don’t know what either is? You are going to learn something very cool (and perspective-changing) then. Know them, but want an experience report? Got you covered, fam.
OpenAI recently published a fun paper where they showed using evolution algorithms to train policy networks to perform on par with state of the art reinforcement deep learning. In this talk we’ll try to reimplement the main ideas in that paper using Neanderthal (blazing fast matrix and linear algebra computations) and Cortex (neural networks); make it massively distributed using Onyx; build a simulation environment using re-frame; and of course save our princess from no particular harm in our toy game example
How to systematically open a new market where every step is supported by data, how to set up learning loops, and where to look for optimization opportunities.
You can do cool and unexpected things if your entire type system is a first class citizen and accessible at runtime.
With the introduction of spec, Clojure got its own distinct spin on a type system. Just as macros add another -time (runtime and compile time) where the full power of the language can be used, spec does to describing data.
The result is an entire additional type system that is a first class citizen and accessible at runtime that facilitates validation, generative testing (a la QuickCheck), destructuring (pattern matching into deeply nested data), data macros (recursive transformations of data) and a pluginable error system. And then you can start building on top of it.
The talk will be half introduction to spec and the ideas packed within it, and half experience report instrumenting 15k loc production codebase (primarily ETL and analytics) with spec.
Clojure has always been good at manipulating data. With the release of spec and Onyx (“a masterless, cloud scale, fault tolerant, high performance distributed computation system”) good became best. In this talk you will learn about a streaming data layer architecture build around Kafka and Onyx that is self-describing, declarative, scalable and convenient to work with for the end user. The focus will be on the power and elegance of describing data and computation with data; the inferences and automations that can be built on top of that; and how and why Clojure is a natural choice for tasks that involve a lot of data manipulation, touching both on functional programming and lisp-specifics such as code-is-data.
We will look at how such an approach can be used to manage a data warehouse by automatically inferring materialized views from raw incoming data or other views based on a combination of heuristics, statistical analysis (seasonality, outlier removal, ...) and predefined ontologies. Doing so is a practical way to maintain a large number of views, increasing their availability and abstracting the complexity into declarative rules, rather than having an ETL pipeline with dozens or even hundreds of hand crafted tasks.
The system described requires relatively little effort upfront but can easily grow with one's needs both in terms of scale as well as scope. With its good introspection capabilities and strong decoupling it is for instance an excellent substrate for putting machine learning algorithms in production, which is the final use-case we will dive into.
Segmentacija je ključna za učinkovito nagovarjanje in konvertiranje potancialnih strank. Simon Belak, vodja analitike pri GoOptiju in transmedijski urednik pri kritičnem časopisu Tribuna, je razkril, kako odkrivati segmente iz podatkov.
Po njegovih besedah je povsem neupravičeno, da je segmentacija povečini statična in narejena na slepo, neupoštevajoč podatke. V predavanju je predstavil aletrnativo: analitično delno avtomatično odkrivanje segmentov iz podatkov.
Na konkretnih primerih je pokazal, kako preslikati podatke o interakcijah s strankami (obisk strani kot pokazatelji interesov, odgovori na ankete, vzorci premikanja po straneh, odpiranje emailov…) v model strank in nadaljeval z razdelitvijo v segmente. Simon je za konec izpostavil najpogostejše pasti in drobne trike za primere, ko imamo malo podatkov, ali so le-ti nejasni.
@sbelak
Simon Belak
Using Onyx in anger
Clojure has always been good at manipulating data. With the release of spec and Onyx ("masterless, cloud scale, fault tolerant, high performance distributed computation system") good became best. In this talk I will walk you through a data layer architecture build around Kafka an Onyx that is self-describing, declarative, scalable and convenient to work with for the end user. The focus will be on the power and elegance of describing data and computation with data; and the inferences and automations that can be built on top of that.
Clojure has always been good at manipulating data. With the release of spec and Onyx (“a masterless, cloud scale, fault tolerant, high performance distributed computation system”) good became best. In this talk you will learn about a data layer architecture build around Kafka and Onyx that is self-describing, declarative, scalable and convenient to work with for the end user. The focus will be on the power and elegance of describing data and computation with data; and the inferences and automations that can be built on top of that.
Whenever a programming language comes out with a new feature, us smug lisp weenies shrug and point out how lisp had that in the early seventies; and if you look at the list of influences of a given language, there is bound to be a lisp in there. In this talk I will try to unpack what makes lisp special, why it is called programming programming language , how it changes one’s thinking, and how that thinking can be applied elsewhere.
Successfully forecasting future demand is key in allowing GoOpti its low prices while isolating transport partners from risk. It this talk Simon Belak, Chief Data scientist at GoOpti, will take you through how he approaches forecasting and the lessons that he learned along the way. The focus is going to be on models that do not require excessive amounts of data, are legible and work well as part of a continuous process (rather than being a one-of problem).
In this talk, you will discover how the 15k LOC codebase was implemented with spec so you don't have to (but probably should). Validation; testing; destructuring; composable “data macros” via conformers; we’ve tried spec in all its multifaceted glory. You will discover a distillation of lessons learned interspersed with musing on how spec alters development flow and one’s thinking.
We instrumented 15k LOC codebase with spec so you don't have to (but probably should). Validation; testing; destructuring; composable "data macros" via conformers; we've tried spec in all its multifaceted glory. This talk is a distillation of lessons learned interspersed with musing on how spec alters development flow and one's thinking.
Presented at EuroClojure 2016
Having programmers do data science is terrible, if only everyone else were not even worse! The problem is tools – either a bunch of libraries and an agnostic IDE, or some point-and-click wonder which no matter how glossy never quite fits our need. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way. This talk is a meditation on how I do data science with Clojure, what the ideal process would look like, and the tools needed to get there. Some already exists (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
Clojure is fantastic for data manipulation and rapid prototyping, but falls short when it comes to communicating your insights. What is lacking are good visualization libraries and (sharable) notebook-like environments. I'll show my workflow which weaves Clojure with R (for ggplot) and Python (for scikit-learn) and tell you why it's wrong; how IPythons of the world have trapped us in a local maximum and why we need a reconceptualization similar to what a REPL does to programming. All this interposed with my experience doing data science with Clojure (everything from ETL to on-the-spot analysis during brainstormings) and how these are interwoven into the design of Huri my library for the lazy data scientist.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
3. The Problem
… but eventually
• Want granularity smaller than GA exposes
• Want analysis GA doesn’t support
• Want to combine and analyse data from different sources
4. Goal: answer 80% of
questions stemming from
data in 20min or less
5. The analytics chasm
2 min 20 min project
Ideal. Almost real-time. Can be
done during brainstorming
without disrupting the flow.
fail
Added to roadmapSqueeze in
somewhere
in the day
6. Levelling up
1.Acquire data (directly, or from 3rd party APIs)
2.Store it in a data warehouse
3.Transform it to a usable and unified shape
4.Perform analytics on it
7. Intermezzo: My perspective
• Core developer at Metabase, an open source BI/analytics
tool. 3rd largest BI tool in the world. 20k+ companies use
us daily, including N26, Revolut, Swisscom
• Built analytics department at GoOpti from the ground up
• Helped 20+ companies become data-driven
8. Levelling up
1.Acquire data (directly, or from 3rd party APIs)
2.Store it in a data warehouse
3.Transform it to a usable and unified shape
4.Perform analytics on it
9. Collecting requirements
1.Make a list of all the data sources you currently have, how much data is in
them (number of entities), and at what rate the data grows
2.Collect user stories from all potential users:
As a ______ I’d like to _________, because _________
3.Match each user story with needed data sources
4.Rank user stories using PIE (probability, impact, effort)
5.Rank data sources by summing the PIE score of all user stories that require it.
6.Build data infrastructure to enable the high-value cluster
7.Continue doing steps 1-6 as you iterate
10. A minimal data-collection
plan
• Event stream
• Goal: be able to reconstruct any given session from data
• Timestamp, session, action, payload, context/result
12. Extract-Load-Transform
• Dump data somewhere as soon as possible so you don’t
loose it.
• Databases are fast and powerful enough to do most
transforms there. In return you get:
• Observability
• Analysts become more self-sufficient (if they know SQL)
• For small-medium data size (< 1M data points/day)
more performant and requires much less infrastructure
13. Good ELT is:
• Repeatable
• Observable
• Extensible
• Scalable
• Recoverable (don’t loose data, ever!)
15. Identify principle axis of
your data
• User, account, transaction, instance, product, event (log)…
• There will (and should) be some overlap
• Different axis will have different granularity
• Some should be ordered in time
16. Data warehouse topology
• Big fat denormalised tables, one for each principle axis
• Use views to tailor the representation to your tools and
analysis needs
17. Which DB?
• Optimize for ease of ad-hoc querying
• Should be decently performant (waiting kills productivity)
but is unlikely to be the bottleneck
• Simple to deploy, connect to, and use
• Strong data validation/schemas, but should also handle
non-structured data (validation on load = data loss)
• Sane handling of timezones, date time arithmetics, &
numbers
18. My go-to stack
• Snowplow for event-like data
• Apache Airflow to manage the workflow
• (managed) Postgres for data warehouse (or Druid if only event data and a
lot of it)
• dbt for data transforms
• Metabase for analytics
• Fully open-source
• Extensible, performant
36. You can often encode
dynamic processes as
binary outcomes
37. Signal or noise?
• Trend & relative change often tell more than absolute
values Percentiles
• Intra- vs. inter-segment variance
• Significance tests
• Sample representativeness (is not just for A/B tests)
• Distribution similarity
• Have a reference point (and reference it often)
39. MESI
• Medical decices
• North-star metric: number of measurements/device
• Current data sources: GA, product database, countly,
sentry, hubspot, Odoo
40. MESI data acquisition
• Collect event stream from devices capturing all the interactions [Snowplow]
• Mirror product database into data warehouse [Airflow]
• Collect event stream from the website [Snowplow]
• Integrate Hubspot and Odoo via API [Airflow]
• Integrate sentry via API [Airflow]
• (Retire Countly)
• (Add support data — Jira, Zendesk, …)
• (Add accounting/billing)
41. MESI data warehouse
• (managed) Postgres
• Principle axis: account, user, device event, user journey
event, device
42. MESI analytics
• Metabase
• User journey before conversion
• Device usage patterns
• UX friction points
• Onboarding
• Errors & support issues
• Segmentation
44. SalesGenomics
• eCommerce marketing agency focused on scale-up
• Typical customer marketing budget 10k-100k/month
• Current data sources: GA, FB, Shopify
• 2-sided reporting: for clients, internal
45. SalesGenomics data
acquisition
• Custom event collector on websites (replacing GA
snippet) [Snowplow]
• Integrate Shoppify, AdWords, FB ads [Airflow]
— OR —
• Use Segment/Stitch Data
49. Starting from 0
• Setup GA (remember the minimal data-collection plan)
• Connect Metabase to your product DB
• Collect data user stories from day 1
• Focus analytics on user journey, segmentation, costs, & UX