I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
Data visualization is often used as the first step while performing a variety of analytical tasks. With the advent of large, high-dimensional datasets and strong interest in data science, there is a need for tools that can support rapid visual analysis. In this paper we describe our vision for a new class of visualization recommendation systems that can automatically identify and interactively recommend visualizations relevant to an analytical task.
Certain modalities (such as text, graphs, tables, and images) can better present recommendations and explanations to users. The focus of this study is the visualization of explanations in recommender systems. The study falls in the area of controlling the recommendation process which gained little attention so far.
Data visualization is often used as the first step while performing a variety of analytical tasks. With the advent of large, high-dimensional datasets and strong interest in data science, there is a need for tools that can support rapid visual analysis. In this paper we describe our vision for a new class of visualization recommendation systems that can automatically identify and interactively recommend visualizations relevant to an analytical task.
Certain modalities (such as text, graphs, tables, and images) can better present recommendations and explanations to users. The focus of this study is the visualization of explanations in recommender systems. The study falls in the area of controlling the recommendation process which gained little attention so far.
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Big Data and Data Science have become increasingly imperative areas in both industry and academia to the extent that every company wants to hire a Data Scientist and every university wants to start dedicated degree programs and centres of excellence in Data Science. Big Data and Data Science have led to technologies that have already shaped different aspects of our lives such as learning, working, travelling, purchasing, social relationships, entertainments, physical activities, medical treatments, etc. This talk will attempt to cover the landscape of some of the important topics in these exponentially growing areas of Data Science and Big Data including the state-of-the-art processes, commercial and open-source platforms, data processing and analytics algorithms (specially large scale Machine Learning), application areas in academia and industry, the best industry practices, business challenges and what it takes to become a Data Scientist.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Presentation at University of Lisbon on Machine Learning and big data.
Deep learning algorithms and applications to credit risk analysis, churn detection and recommendation algorithms
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Big Data and Data Science have become increasingly imperative areas in both industry and academia to the extent that every company wants to hire a Data Scientist and every university wants to start dedicated degree programs and centres of excellence in Data Science. Big Data and Data Science have led to technologies that have already shaped different aspects of our lives such as learning, working, travelling, purchasing, social relationships, entertainments, physical activities, medical treatments, etc. This talk will attempt to cover the landscape of some of the important topics in these exponentially growing areas of Data Science and Big Data including the state-of-the-art processes, commercial and open-source platforms, data processing and analytics algorithms (specially large scale Machine Learning), application areas in academia and industry, the best industry practices, business challenges and what it takes to become a Data Scientist.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Presentation at University of Lisbon on Machine Learning and big data.
Deep learning algorithms and applications to credit risk analysis, churn detection and recommendation algorithms
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningS. Diana Hu
Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
Mining data requires a deep investment in people and time. How can you be sure you’re building the right models? What tools help you connect with the customer’s needs? With this hands-on presentation, you’ll learn a flexible toolset and methodology for building effective analytics applications. Agile Data (the book) shows you how to create an environment for exploring data, using lightweight tools such as Python, Apache Pig, and the D3.js (Data-Driven Documents) JavaScript library. You’ll learn an iterative approach that allows you to quickly change the kind of analysis you’re doing, as you discover what the data is telling you. All the example code in this book is available as working web applications. We will cover how to: * Build an application to mine your own email inbox * Use different data structures and algorithms to extract multiple features from a single dataset, and learn how different perspectives can yield insight * Rapidly boot your applications as simple front-ends to a document store * Add features driven by descriptive and inferential statistics, machine learning, and data visualization * Gather usage data and talk to real users to help guide your data-driven exploration
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
NoSQL Simplified: Schema vs. Schema-lessInfiniteGraph
A look at the many facets of schema-less approaches vs a rich schema approach, ranging from performance and query support to heterogeneity and code/data migration issues. Presented by Leon Guzenda, Founder, Objectivity
Intro to Data Science for Non-Data ScientistsSri Ambati
Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Presentation at Southern California Code Camp July 2013 in San Diego. This talk presents you with basic concepts in world of big data and data science, with focus on relational databases, noSQL, MapReduce, machine learning, and data visualization, along with demos of MapReduce in action and Pig on Hadoop. The purpose of this presentation is to get you familiar with terminologies and concepts in data science, and whet your appetite for further exploration into the world of big data. This presentation is adapted from an online course by Coursera with similar title and scope
Patterns for Successful Data Science Projects (Spark AI Summit)Bill Chambers
Running data science workloads is challenge regardless of whether you are running them on your laptop, on an on-premises cluster, or in the cloud. While buying 100% managed service is an option, these tools can be expensive and lack extensibility. Therefore, many companies option for open source data science tools like scikit-learn and Apache Spark’s MLlib in order to balance both functionality and cost.
However, even if a project succeeds at a point in time with any set of tools, these projects become harder and harder to maintain as data volumes increase and a desire for real-time pushes technology to its limit. New projects also struggle as new challenges of scale invalidate previous assumptions.
This talk will discuss some patterns that we see at Databricks that companies leverage to succeed with their data science projects. Key takeaways will be:
– Striving for simplicity
– Removing cognitive load for you and your team
– Working with data, big and small
– Effectively leveraging the ecosystem of tools to be successful
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
2. Many many contributors!
• PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol
Deshpande (Spanning Illinois,UMD, MIT,Chicago)
• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim,
ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke
• MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui
• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean
Zou, Jialin Liu, Changfeng Liu, XiaofoYu
2
3. Scale is a Solved Problem
Most work in the database community is myopically focused on
scale: the ability to pose SQL queries on larger and larger datasets.
My claim:
Scale is a solved problem.
Findings:
– Median job size at Microsoft andYahoo is 16GB;
– >90% of the jobs within Facebook are <100GB
The bottleneck is no longer our ability to pose SQL queries on
large datasets!
Of course, exceptions exist: the “1%” of data analysis needs
3
4. What about the Needs of the 99%?
The bottleneck is actually the “humans-in-the-loop”
As our data size has grown, what has stayed constant is
• the time for analysis,
• the human cognitive load,
• the skills to extract value from data
There is a severe need for tools that can help analysts extract
value from even moderately sized datasets
From “Big data and and itsTechnical Challenges”, CACM 2014
For big data to fully reach its potential, we need to consider scale not just for the system but
also from the perspective of humans.We have to make sure that the end points—humans—
can properly “absorb” the results of the analysis and not get lost in a sea of data.
4
5. Need of the hour: Human-In-the-Loop
Data AnalyticsTools
HILDA tools:
• treat both humans and data
as first-class citizens
• reduce human labor
• minimize complexity
Interaction Data Mining
Databases
Taking the human
perspective into
account
Go beyond SQL
Scalability/Interacti
vity is still
important
Magic happens here
5
6. A Maslow’s Hierarchy for HILDA
Background: Maslow developed a theory for what motivates
individuals in 1943; highly influential
Complex Needs
Basic Needs
6
7. A Maslow’s Hierarchy for HILDA
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
7
8. Touch and Feel:
DataSpread is a spreadsheet-database hybrid:
Goal: Marrying the flexibility and ease of use of
spreadsheets with the scalability and power of databases
Enables the “99%” with large datasets but limited prog.
skills to open, touch, and examine their datasets
http://dataspread.github.io
[VLDB’15,VLDB’15,ICDE’16]
8
9. Play andView:
Zenvisage is effortless visual exploration tool.
Goal: “fast-forward” to visual patterns, trends, without
having analyst step through each one individually
Enables individuals to play with, and extract insights
from large datasets at a fraction of the time.
http://zenvisage.github.io
[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]
9
10. Collaborate and Share:
OrpheusDB is a tool for managing dataset versions with a database
Goal: building a versioned database system to reduce the burden of
recording datasets in various stages of analysis
Enables individuals to collaborate on data analysis, and share, keep
track of, and retrieve dataset versions.
http://orpheus-db.github.io
[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]
(also part of : a collab. analysis system w/ MIT & UMD)
datahub
10
11. This talk
About 10 minutes per system:
overview + architecture + one key technical challenge
Common theme: if you torture databases enough, you can get them to do
what you want!
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
11
13. Motivation
Most of the people doing ad-hoc data
manipulation and analysis use spreadsheets,
e.g., Excel
Why?
• Easy to use: direct manipulation
• Built-in visualization capabilities
• Flexible: no need for a schema
13
14. But Spreadsheets areTerrible!
– Slow
• single change wait minutes on a 10,000 x 10 spreadsheet
• can’t even open a spreadsheet with >1M cells
• speed by itself can prevent analysis
– Tedious + not Powerful
• filters via copy-paste
• only FK joins viaVLOOKUPs; others impossible
• even simple operations are cumbersome
– Brittle
• sharing excel sheets around, no collab/recovery
• using spreadsheets for collaboration is painful and error-prone
14
15. Let’s turn to Databases
Databases are:
• Slow Scalable
• Tedious + not Powerful Powerful and expressive (SQL)
• Brittle Collaboration, recovery, succinct
So why not use databases?
Well, for the same reason why spreadsheets are so useful:
• Easy to use Not easy to use
• Built-in visualization No built-in visualization
• Flexible Not flexible
15
16. Combining the benefits of
spreadsheets and databases
Spreadsheet as a frontend interface
Databases as a backend engine
Result: retain the benefits of both!
But it’s not that simple…
16
17. Different Ideologies
Databases and spreadsheets have different
ideologies that need to be reconciled…
Due to this, the integration is not trivial…
Feature Databases Spreadsheets
Data Model Schema-first Dynamic/No Schema
Addressing Tuples with PK Cells, using Row/Col
Presentation Set-oriented, no such
notion
Notion of current window,
order
Modifications Must correspond to
queries
Can be done at any
granularity
Computation Query at a time Value at a time
17
18. First Problem: Representation
Q: how do we represent spreadsheet data?
Dense spreadsheets: represent as tables
(Row #, Col1 val, Col2 val, …)
Sparse spreadsheets: represent as triples
(Row #, Column #,Value)
18
19. First Problem: Representation
Q: how do we represent spreadsheet data?
Can we do even better than the two
extremes?Yes!
Carve out
dense areas store as tables,
sparse areas store as triples
19
20. First Problem: Representation
However, even if we only use “tables”, carving out
the ideal # partitions (min. storage, modif., access)
is NP-Hard
Reduction from min. edge-length partition of
rectilinear polygons
Thankfully, we have a way out…
20
21. Solution: Constrain the Problem
A new class of partitionings: recursive decomp.
A very natural class of partitionings! 21
22. Solution: Constrain the Problem
The optimal recursive
decomp. partitioning can be
found in PTIME using DP
Still quadratic in # rows,
columns
Merge rows/columns with
identical signatures
~ the time for a single scan
22
23. Initial Progress and Architecture
Postgres backend
ZK spreadsheet
• open-source web
frontend
Comfortably scales to
arbitrarily many rows
+ handle SQL queries
Hopefully bring
spreadsheets to the big
data age!
Underlying Data
Interface-Embedded
Queries
Interface-Aware
Indexes
Interface Query Processor
Interface Storage Manager
Spreadsheet
SQL
Spreadsheet
Formulae
New Interface
Algebra
…
Vanilla
SQL
Interface Transaction Manager
Other Applications Sally Bob Sue
23
1224560
24.
25. StandardVisual Data Analysis Recipe:
1. Load dataset into viz tool
2. Select viz to be generated
3. See if it matches desired visual
pattern or insight
4. Repeat until you find a match
25
27. Key Issue:
Visualizations can be generated by
• varying subsets of data, and
• varying attributes being visualized
Too many visualizations to look at to find
desired visual patterns!
27
28. Motivation
This is a real problem!
• Advertisers atTurn
– find keywords with similar CTRs to a specific one
• Bioinformaticians at an NIH genomics center
– find aspects on which two sets of genes differ
• Battery scientists at CMU
– find solvents with desired properties
Common theme: finding the “right” visualization can take
several hours of combing through visualizations manually.
28
29. Key Insight
We can automate that!
• instead of combing through visualizations manually
• tell us what you want, and we can “fast-forward” to desired insights
Desiderata for automation:
• Expressive – the ability to specify what you want
• Interactive – interact with the results, catering to non-programmers
• Scalable – get interesting results quickly
Enter Zenvisage:
(zen + envisage: to effortlessly visualize)
29
30. EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and sketch based interactions
• to find specific patterns
• Sophisticated visual exploration language, ZQL
• to ask more elaborate questions
• Scalable visualization generation engine
• preprocess, batch and parallel eval. for interactive results
• Rapid pattern matching algorithms
• sampling-based techniques
30
33. Challenges: One Specific Instance
Find visualizations on which two groups of data differ most.
Examples:
• find visualizations where solvent x differs from solvent y
• find visualizations where product x differs from product y
We represent a visualization using [d, m, f]
• dimension = x axis
• measure = y axis
• function = aggregate applied to y
Each [d,m,f] on a specific subset of data can be computed using a
single SQL query.
33
34. Challenge: One Specific Instance
Find visualizations on which two groups of data differ most.
Naïve approach:
For each [d, m, f]:
Compute visualization for both products (two SQL queries),
then compare
Pick k best (“highest utility”) [d, m, f]
Utility Metric:We ignore how to compare for now, but there are
many standard distance metrics
Scale: 10s of dimensions, 10s of measures, handful of
aggregates 100s of queries for a single user task!
34
35. Issues w/ Naïve Approach
• Repeated processing of same
data in sequence across queries
• Computation wasted on low-
utility visualizations
Sharing
Pruning
35
42. Motivation
Collaborative data science is
ubiquitous
• Many users, many versions of the
same dataset stored at many
stages of analysis
• Status quo:
– Stored in a file system, relationships
unknown
Challenge: can we build a versioned
data store?
– Support efficient access, retrieval,
querying, and modification of
versions
42
43. Motivation: Starting Points
• VCS: Git/svn is inefficient and unsuitable
– Ordered semantics
– No data manipulation API
– No efficient multi-version queries
– Poor support for massive files
• DBMS: Relational databases don’t support
versioning, but are efficient and scalable
43
45. Challenge: StoringVersions
Compactly/RetrievingVersions Quickly
1000s of versions, spanning millions of records.
Store all versions independently
Huge storage, version access time is very small
Store one version, all others via chains of “deltas”
Very small storage, version access time is high
45
46. And Answer Queries…
• Retrieve the first version that contains this tuple
• Find versions where the average(salary) is
greater than 1000
• Find all pairs of versions where over 100 new
tuples were added
• Show the history of the tuple with record id 34.
For more examples, see [TAPP’15]
46
48. Summary:
Make Data Analytics Great Again!
orpheus-db.github.ioShare &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
zenvisage.github.io
dataspread.github.io
My website: http://data-people.cs.illinois.edu
Twitter: @adityagp 48