An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data Science Teams,
and a 2-year survey of Enterprise Use Cases.
http://www.eventbrite.com/event/7476758185
Data Center Computing for Data Science: an evolution of machines, middleware,...Paco Nathan
Guest lecture 2013-08-27 at General Assembly in SF for the Data Science program taught by Jacob Bollinger and Thomson Nguyen https://generalassemb.ly/education/data-science/san-francisco
Many thanks to Thomson, Jacob, and the participants in the course. Excellent Q&A!
Received a bottle o' Cardhu (my fave Scotch) in payment for lecture, and since it's Burning Man Week, the city was emptied so we had enough to share with the class :)
Evidence:
https://plus.google.com/u/0/110794698656267747127/posts/GvjhhQ99CTs
The Internet of Things, or the IoT is a vision for a ubiquitous society wherein people and “Things” are connected in an immersively networked computing environment, with the connected “Things” providing utility to people/enterprises and their digital shadows, through intelligent social and commercial services. However, translating this idea to a conceivable reality is a work in progress for close to two decades; mostly, due to assumptions favoured more towards a “Things”-centric rather than a “Human”-centric approach coupled with the evolution/deployment ecosystem of IoT technologies.
Estimates on the spread and economic impact of IoT over the next few years are in the neighborhood of 50 billion or more connected “Things” with a market exceeding $350 billion through smarter cities and infrastructure, intelligent appliances, and healthier lifestyles. While many of these potential benefits from IoT are real and achievable, the road to accomplish these may need an rethink.
In the last few years, there has been a realization that an effective architecture for IoT (particularly, for emerging nations with limited technology penetration at the national scale) that is both affordable and sustainable should be based on tangible technology advances in the present, ubiquitous capabilities of the present/future, and practical application scenarios of social and entrepreneurial value. Hence, there is a revitalized interest to rethink the above assumptions, and this exercise has led to a more plausible set of scenarios wherein humans along with data, communication and devices play key roles.
In this presentation, an attempt is made to disaggregate these core problems; and offer a trajectory with a set of design paradigms for a renewed IoT ecosystem.
Python's Role in the Future of Data AnalysisPeter Wang
Why is "big data" a challenge, and what roles do high-level languages like Python have to play in this space?
The video of this talk is at: https://vimeo.com/79826022
A look back at how the practice of data science has evolved over the years, modern trends, and where it might be headed in the future. Starting from before anyone had the title "data scientist" on their resume, to the dawn of the cloud and big data, and the new tools and companies trying to push the state of the art forward. Finally, some wild speculation on where data science might be headed.
Presentation given to Seattle Data Science Meetup on Friday July 24th 2015.
Data Center Computing for Data Science: an evolution of machines, middleware,...Paco Nathan
Guest lecture 2013-08-27 at General Assembly in SF for the Data Science program taught by Jacob Bollinger and Thomson Nguyen https://generalassemb.ly/education/data-science/san-francisco
Many thanks to Thomson, Jacob, and the participants in the course. Excellent Q&A!
Received a bottle o' Cardhu (my fave Scotch) in payment for lecture, and since it's Burning Man Week, the city was emptied so we had enough to share with the class :)
Evidence:
https://plus.google.com/u/0/110794698656267747127/posts/GvjhhQ99CTs
The Internet of Things, or the IoT is a vision for a ubiquitous society wherein people and “Things” are connected in an immersively networked computing environment, with the connected “Things” providing utility to people/enterprises and their digital shadows, through intelligent social and commercial services. However, translating this idea to a conceivable reality is a work in progress for close to two decades; mostly, due to assumptions favoured more towards a “Things”-centric rather than a “Human”-centric approach coupled with the evolution/deployment ecosystem of IoT technologies.
Estimates on the spread and economic impact of IoT over the next few years are in the neighborhood of 50 billion or more connected “Things” with a market exceeding $350 billion through smarter cities and infrastructure, intelligent appliances, and healthier lifestyles. While many of these potential benefits from IoT are real and achievable, the road to accomplish these may need an rethink.
In the last few years, there has been a realization that an effective architecture for IoT (particularly, for emerging nations with limited technology penetration at the national scale) that is both affordable and sustainable should be based on tangible technology advances in the present, ubiquitous capabilities of the present/future, and practical application scenarios of social and entrepreneurial value. Hence, there is a revitalized interest to rethink the above assumptions, and this exercise has led to a more plausible set of scenarios wherein humans along with data, communication and devices play key roles.
In this presentation, an attempt is made to disaggregate these core problems; and offer a trajectory with a set of design paradigms for a renewed IoT ecosystem.
Python's Role in the Future of Data AnalysisPeter Wang
Why is "big data" a challenge, and what roles do high-level languages like Python have to play in this space?
The video of this talk is at: https://vimeo.com/79826022
A look back at how the practice of data science has evolved over the years, modern trends, and where it might be headed in the future. Starting from before anyone had the title "data scientist" on their resume, to the dawn of the cloud and big data, and the new tools and companies trying to push the state of the art forward. Finally, some wild speculation on where data science might be headed.
Presentation given to Seattle Data Science Meetup on Friday July 24th 2015.
Big data is everywhere , although sometimes we may not immediately realize it . First thing to be believed is that most of us don't deal with large amount of data in our life except in unusual circumstance. Lacking this immediate experience, we often fail to understand both opportunities as well challenges presented by big data. There are currently a number of issues and challenges in addressing these characteristics going forward.
Are you having doubts and questions about how to use Big Data in your organizations? The presentation here would clear some of your doubts.
Feel free to comment if you have more queries or write to us at: bigdata@xoriant.com
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...DATAVERSITY
Google “citizen data scientist” today and you will see about 1M results. That number is data. It may be interesting, but it is meaningless without context. Sometimes it appears that we are drowning in data from systems and sensors but starving for insights. We definitely produce more of the former than the latter, which has created demand for more powerful tools to simplify the process and lower the skills requirement for analysis. As vendors build systems to meet this demand, we hear about the coming ”democratization” of big data as more people at varying levels within organizations are empowered to find meaning and improve their own performance with data-driven insights. This is a good thing, but it does require caution.
To paraphrase Col Jessup in A Few Good Men: You want answers? You can’t handle the data.
In this webinar, we will survey emerging approaches to simplifying analysis, and discuss the benefits, dangers, and skills required for individuals and organizations to thrive in the brave new world of analytics everywhere, for everyone.
In this presentation, Wes Eldridge will provide a general overview on data science. The talk will cover a variety of topics, Wes will start with the dirty history of the field which will help add context. After learning about the history of data and data science Wes will discuss the common roles a data scientist holds in business and organizations. Next, he will talk about how to use data in your organization and products. Finally, he'll cover some tools to help you get started in data science. After the presentation, Wes will stick around for Q/A and data discussion.
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
Big Data consists of large-volume, complex, growing data sets with multiple, heterogenous sources. With the
tremendous development of networking, data storage, and the data collection capacity, Big Data are now rapidly
expanding in all science and engineering domains, including physical, biological and biomedical sciences. The
MapReduce programming mode which has parallel processing ability to analyze the large-scale network.
MapReduce is a programming model that allows easy development of scalable parallel applications to process
big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop
is a powerful tool for building such applications.
The presentation is about the career path in the field of Data Science. Data Science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
In this presentation, I have talked about Big Data and its importance in brief. I have included the very basics of Data Science and its importance in the present day, through a case study. You can also get an idea about who a data scientist is and what all tasks he performs. A few applications of data science have been illustrated in the end.
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
Big data is everywhere , although sometimes we may not immediately realize it . First thing to be believed is that most of us don't deal with large amount of data in our life except in unusual circumstance. Lacking this immediate experience, we often fail to understand both opportunities as well challenges presented by big data. There are currently a number of issues and challenges in addressing these characteristics going forward.
Are you having doubts and questions about how to use Big Data in your organizations? The presentation here would clear some of your doubts.
Feel free to comment if you have more queries or write to us at: bigdata@xoriant.com
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...DATAVERSITY
Google “citizen data scientist” today and you will see about 1M results. That number is data. It may be interesting, but it is meaningless without context. Sometimes it appears that we are drowning in data from systems and sensors but starving for insights. We definitely produce more of the former than the latter, which has created demand for more powerful tools to simplify the process and lower the skills requirement for analysis. As vendors build systems to meet this demand, we hear about the coming ”democratization” of big data as more people at varying levels within organizations are empowered to find meaning and improve their own performance with data-driven insights. This is a good thing, but it does require caution.
To paraphrase Col Jessup in A Few Good Men: You want answers? You can’t handle the data.
In this webinar, we will survey emerging approaches to simplifying analysis, and discuss the benefits, dangers, and skills required for individuals and organizations to thrive in the brave new world of analytics everywhere, for everyone.
In this presentation, Wes Eldridge will provide a general overview on data science. The talk will cover a variety of topics, Wes will start with the dirty history of the field which will help add context. After learning about the history of data and data science Wes will discuss the common roles a data scientist holds in business and organizations. Next, he will talk about how to use data in your organization and products. Finally, he'll cover some tools to help you get started in data science. After the presentation, Wes will stick around for Q/A and data discussion.
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
Big Data consists of large-volume, complex, growing data sets with multiple, heterogenous sources. With the
tremendous development of networking, data storage, and the data collection capacity, Big Data are now rapidly
expanding in all science and engineering domains, including physical, biological and biomedical sciences. The
MapReduce programming mode which has parallel processing ability to analyze the large-scale network.
MapReduce is a programming model that allows easy development of scalable parallel applications to process
big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop
is a powerful tool for building such applications.
The presentation is about the career path in the field of Data Science. Data Science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
In this presentation, I have talked about Big Data and its importance in brief. I have included the very basics of Data Science and its importance in the present day, through a case study. You can also get an idea about who a data scientist is and what all tasks he performs. A few applications of data science have been illustrated in the end.
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...PhD Assistance
Machine Learning (ML) is rapidly used in a variety of applications. It has risen to prominence in recent years, owing in part to the emergence of big data. When it comes to big data, ML algorithms have never been more promising. Big data allows machine learning algorithms to discover finer-grained patterns and make more timely and precise predictions than ever before; however, it also poses significant challenges to machine learning, such as model scalability and distributed computing.
Learn More: https://bit.ly/2RB1buD
Contact Us:
Website: https://www.phdassistance.com/
UK NO: +44–1143520021
India No: +91–4448137070
WhatsApp No: +91 91769 66446
Email: info@phdassistance.com
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
Toward a System Building Agenda for Data Integration
(and Data Science)
AnHai Doan, Pradap Konda, Paul Suganthan G.C., Adel Ardalan, Jeffrey R. Ballard, Sanjib Das,
Yash Govind, Han Li, Philip Martinkus, Sidharth Mudgal, Erik Paulson, Haojun Zhang
University of Wisconsin-Madison
Abstract
We argue that the data integration (DI) community should devote far more effort to building systems,
in order to truly advance the field. We discuss the limitations of current DI systems, and point out that
there is already an existing popular DI “system” out there, which is PyData, the open-source ecosystem
of 138,000+ interoperable Python packages. We argue that rather than building isolated monolithic DI
systems, we should consider extending this PyData “system”, by developing more Python packages that
solve DI problems for the users of PyData. We discuss how extending PyData enables us to pursue an
integrated agenda of research, system development, education, and outreach in DI, which in turn can
position our community to become a key player in data science. Finally, we discuss ongoing work at
Wisconsin, which suggests that this agenda is highly promising and raises many interesting challenges.
1 Introduction
In this paper we focus on data integration (DI), broadly interpreted as covering all major data preparation steps
such as data extraction, exploration, profiling, cleaning, matching, and merging [10]. This topic is also known
as data wrangling, munging, curation, unification, fusion, preparation, and more. Over the past few decades, DI
has received much attention (e.g., [37, 29, 31, 20, 34, 33, 6, 17, 39, 22, 23, 5, 8, 36, 15, 35, 4, 25, 38, 26, 32, 19,
2, 12, 11, 16, 2, 3]). Today, as data science grows, DI is receiving even more attention. This is because many
data science applications must first perform DI to combine the raw data from multiple sources, before analysis
can be carried out to extract insights.
Yet despite all this attention, today we do not really know whether the field is making good progress. The
vast majority of DI works (with the exception of efforts such as Tamr and Trifacta [36, 15]) have focused on
developing algorithmic solutions. But we know very little about whether these (ever-more-complex) algorithms
are indeed useful in practice. The field has also built mostly isolated system prototypes, which are hard to use and
combine, and are often not powerful enough for real-world applications. This makes it difficult to decide what
to teach in DI classes. Teaching complex DI algorithms and asking students to do projects using our prototype
systems can train them well for doing DI research, but are not likely to train them well for solving real-world DI
problems in later jobs. Similarly, outreach to real users (e.g., domain scientists) is difficult. Given that we have
Copyright 0000 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for
advertising or promotional purpose ...
Higher education institutions now a days are operating in an increasingly complex and
competitive environment. The application of innovation is a must for sustaining its competitive advantage.
Institution leaders are using data management and analytics to question the status quo and develop effective
solutions. Achieving these insights and information requires not a single report from a single system, but
rather the ability to access, share, and explore institution-wide data that can be transformed into meaningful
insights at every level of the institution. Consequently, institutions are facing problems in providing necessary
information technology support for fulfilling excellence in performance. More specifically, the best practices
of big data management and analytics need to be considered within higher education institutions. Therefore,
the study aimed at investigating big data and analytics, in terms of: (1) definition; (2) its most important
principles; (3) models; and (4) benefits of its use to fulfill performance excellence in higher education
institutions. This involves shedding light on big data and analytics models and the possibility of its use in
higher education institutions, and exploring the effect of using big data and analytics in achieving performance
excellence. To reach these objectives, the researcher employed a qualitative research methodology for
collecting and analyzing data. The study concluded the most important result, that there is a significant
relationship between big data and analytics and excellence of performance as big data management and
analytics mainly aims at achieving tasks quickly with the least effort and cost. These positive results support
the use of big data and analytics in institutions and improving knowledge in this field and providing a practical
guide adaptable to the institution structure. This paper also identifies the role of big data and analytics in
institutions of higher education worldwide and outlines the implementation challenges and opportunities in the
education industry.
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Sahilakhurana
Banking and securities
Challenges
Early warning for securities fraud and trade visibilities
Card fraud detection and audit trails
Enterprise credit risk reporting
Customer data transformation and analytics.
The Security Exchange commission (SEC) is using big data to monitor financial market activity by using network analytics and natural language processing. This helps to catch illegal trading activity in the financial markets.
The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects. The lifecycle has six phases, and project work can occur in several phases at once. For most phases in the lifecycle, the movement can be either forward or backward. This iterative depiction of the lifecycle is intended to more closely portray a real project, in which aspects of the project move forward and may return to earlier stages as new information is uncovered and team members learn more about various stages of the project. This enables participants to move iteratively through the process and drive toward operationalizing the project work.
Phase 1—Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.
Phase 2—Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data.
Data and Analytics Career Paths, Presented at IEEE LYC'19.
About Speaker:
Ahmed Amr is a Data/Analytics Engineer at Rubikal, where he leads, develops, and creates daily data/analytics operations, which includes data ingestion , data streaming, data warehousing, and analytical dashboards. Ahmed is graduated from Computer Engineering Department, Alexandria University; and he is currently pursuing his MSc degree in Computer Science, AAST. Professionally, Ahmed worked with Egyptian/US startups such as (Badr, Incorta, WhoKnows) to develop their data/analytics projects. Academically, Ahmed worked as a Teaching Assistant in CS department, AAST. Ahmed helps software companies to develop robust data engineering infrastructure, and powerful analytical insights.
References:
1) https://www.datacamp.com/community/tutorials/data-science-industry-infographic
2) Analytics: The real-world use of big data, IBM, Executive Report
At Ikeen Technologies, we combine expertise in various domains, including software development, web design, data analytics, artificial intelligence, and cloud computing, to offer comprehensive solutions that meet the unique needs of our clients. Our team of skilled professionals possesses deep industry knowledge and technical
Techeduxon is a cutting-edge technology company that specializes in developing innovative solutions and educational tools for the field of education. With a strong focus on integrating technology into learning environments, Techeduxon aims to enhance the educational experience for students and educators alike.
At Techeduxon, a team of passionate engineers, designers, and educators collaborate to create high-quality products that address the evolving needs of modern education. Their range of offerings includes software applications, interactive learning platforms, hardware devices, and curriculum resources.
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldNeil Raden
With Global Data Management methodology and tools, all of your data can be accessed and used no matter where it is or where it is from: on-premises, private cloud, public cloud(s), hybrid cloud, open source, third-party data and any combination of the these, with security, privacy and governance applied as if they were a single entity. Ingenious software products and the economics of computing make it economical to do this. Not free, but feasible.
From eGov 2.0 to eGov 3.0: The Research Agendasamossummit
The research agenda of the eGov area will be discussed in this session, focusing on innovative ideas and directions for its advancement from the eGov 2.0 to a new eGov 3.0 paradigm.
Yannis Charalabidis, University of the Aegean, Greece
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose.
http://meetup.com/SF-Bay-ACM/events/221693508/
Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets.
O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
DSSG Speaker Series: Paco Nathan
1. DSSG Speaker Series, 2013-08-12:
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data ScienceTeams,
and a 2-year survey of Enterprise Use Cases
Paco Nathan @pacoid
Chief Scientist, Mesosphere
1
2. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
2
3. employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking
3
4. Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced the prior practices
of data modeling
because the data won’t fit on one computer anymore
4
5. Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
chronicled a sea change from data modeling (silos, manual
process) to the rising use of algorithmic modeling (machine
data for automation/optimization) which led in turn to the
practice of leveraging inter-disciplinary teams
5
6. approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log files, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks that can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to socialize the problems, knocking down silos
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making process repeatable
What is needed most?
UniqueRegistration
aunchedgameslobby
NUI:TutorialMode
BirthdayMessage
hatPublicRoomvoice
unchedheyzapgame
Test:testsuitestarted
CreateNewPet
rted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
paceremaining:512M
aseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
anelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
sspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
anelRemoveProduct
yPanelApplyProduct
NUI:DressUpMode
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ConnectivityTest:testsuitestarted
CreateNewPet
MovieViewStarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
Addressspaceremaining:512M
CustomerMadePurchaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
ClientInventoryPanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Addressspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
ClientInventoryPanelRemoveProduct
ClientInventoryPanelApplyProduct
NUI:DressUpMode
6
7. apps
discovery
modeling
integration
systems
help people ask the
right questions
allow automation to
place informed bets
deliver data products
at scale to LOB end uses
build smarts into
product features
keep infrastructure
running, cost-effective
Team Process = Needs
analysts
engineers
inter-disciplinary
leadership
7
8. business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
Team Composition = Roles
leverage non-traditional
pairing among roles, to
complement skills and
tear down silos
8
10. Alternatively, Data Roles × Skill Sets
Harlan Harris, et al.
datacommunitydc.org/blog/wp-content/uploads/
2012/08/SkillsSelfIDMosaic-edit-500px.png
Analyzing the Analyzers
Harlan Harris, Sean Murphy,
Marck Vaisman
O’Reilly, 2013
amazon.com/dp/B00DBHTE56
10
11. Learning Curves
difficulties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codified practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that anti-pattern leads to big teams, low ROI
scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context
11
12. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
12
13. Business Disruption through Data
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm
@Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the
entire Global 1000 on notice over the next decade…
data as the major force… mostly through apps –
verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc.
@XLDB, 2012:
complex analytics workloads are now displacing SQL
as the basis for Enterprise apps
13
14. Data Categories
Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
let’s now add other useful distinctions:
• Open Data
• Curated Metadata
• A/D conversion for sensors (IoT)
14
15. Open Data notes
successful apps incorporate three components:
• Big Data (consumer interest, personalization)
• Open Data (monetizing public data)
• Curated Metadata
most of the largest Cascading deployments leverage some
Open Data components: Climate Corp, Factual, Nokia, etc.
consider buildingeye.com, aggregate building permits:
• pricing data for home owners looking to remodel
• sales data for contractors
• imagine joining data with building inspection history,
for better insights about properties for sale…
research notes about
Open Data use cases:
goo.gl/cd995T
15
16. Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
Adapted from:
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
Drivers, circa 2013
• governments have run out of money,
cannot increase staff and services
• better data infra at scale (cloud, OSS, etc.)
• machine learning techniques to monetize
• viable ecosystem for data products,APIs
• mobile devices enabling use cases
16
17. Open Data ecosystem
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Data feeds structured for
public private partnerships
17
18. Open Data ecosystem – caveats for agencies
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• respond to viable use cases
• not budgeting hackathons
18
19. Open Data ecosystem – caveats for publishers
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• surface the metadata
• curate, allowing for joins/aggregation
• not scans as PDFs
19
20. Open Data ecosystem – caveats for aggregators
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• make APIs consumable by automation
• allow for probabilistic usage
• not OSS licensing for data
20
21. Open Data ecosystem – caveats for data vendors
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• supply actionable data
• track data provenance carefully
• provide feedback upstream,
i.e., cleaned data at source
• focus on core verticals
21
22. Open Data ecosystem – caveats for end uses
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• address consumer needs
• identify community benefits
of the data
22
23. algorithmic modeling
+ machine data (Big Data)
+ curation, metadata
+ Open Data
data products, as feedback into automation
evolution of feedback loops
less about “bigness”, more about complexity
internet of things
+ A/D conversion
+ more complex analytics
accelerated evolution, additional feedback loops
orders of magnitude higher data rates
Recipes for Success
source: National Geographic
“A kind of Cambrian explosion”
source: National Geographic
23
25. Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc.,
plus the effects of Google Glass
7+ billion people, instrumented better than … how we
have Nagios instrumenting our web servers right now
technologyreview.com/...
25
26. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
26
27. in general, apps alternate between learning patterns/rules
and retrieving similar things…
machine learning – scalable, arguably quite ad-hoc,
generally “black box” solutions, enabling you to make billion
dollar mistakes, with oh so much commercial emphasis
(i.e. the “heavy lifting”)
statistics – rigorous, much slower to evolve, confidence
and rationale become transparent, preventing you from
making billion dollar mistakes, any good commercial project
has ample stats work used in QA
(i.e.,“CYA, cover your analysis”)
once Big Data projects get beyond merely digesting
log files, optimization will likely become the next
overused buzzword :)
Learning Theory
27
28. Generalizations about Machine Learning…
great introduction to ML, plus a proposed categorization
for comparing different machine learning approaches:
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
toward a categorization for Machine Learning algorithms:
• representation: classifier must be represented in some
formal language that computers can handle (algorithms, data
structures, etc.)
• evaluation: evaluation function (objective function, scoring
function) is needed to distinguish good classifiers from bad
ones
• optimization: method to search among the classifiers in
the language for the highest-scoring one
28
29. Something to consider about Algorithms…
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, U Maryland
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead of e-commerce in
terms of data rates and sophisticated algorithms work – as Breiman
suggested in 2001 – may take a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
29
30. Make It Sparse…
also, take a moment to check this out…
(and related work on sparse Cholesky, etc.)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efficient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
30
31. Sparse Matrix Collection
for those times when you really, really need
a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collection
cise.ufl.edu/research/sparse/matrices/
Tim Davis, U Florida
cise.ufl.edu/~davis/welcome.html
Yifan Hu, AT&T Research
www2.research.att.com/~yifanhu/
31
32. A Winning Approach…
consider that if you know priors about a system, then
you may be able to leverage low dimensional structure
within high dimensional data… what impact does that
have on sampling rates?
1. real-world data
2. graph theory for representation
3. sparse matrix factorization for production work
4. cost-effective parallel processing
for machine learning app at scale
32
33. Just Enough Mathematics?
having a solid background in statistics becomes vital,
because it provides formalisms for what we’re trying
to accomplish at scale
along with that, some areas of math help – regardless
of the “calculus threshold” invoked at many universities…
linear algebra e.g., calculating algorithms for large-scale apps efficiently
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI
33
34. ADMM: a general approach for optimizing learners
Distributed Optimization and Statistical Learning
via the Alternating Direction Method of Multipliers
Stephen Boyd, Neal Parikh, et al., Stanford
stanford.edu/~boyd/papers/admm_distr_stats.html
“Throughout, the focus is on applications rather than theory, and a main goal is
to provide the reader with a kind of ‘toolbox’ that can be applied in many situations
to derive and implement a distributed algorithm of practical use.Though the focus
here is on parallelism, the algorithm can also be used serially, and it is interesting
to note that with no tuning, ADMM can be competitive with the best known
methods for some problems.”
“While we have emphasized applications that can be concisely explained, the
algorithm would also be a natural fit for more complicated problems in areas
like graphical models. In addition, though our focus is on statistical learning
problems, the algorithm is readily applicable in many other cases, such as in
engineering design, multi-period portfolio optimization, time series analysis,
network flow, or scheduling.”
34
35. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
35
36. Enterprise Data Workflows
middleware for Big Data applications is evolving,
with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
ParAccel Big Data Analytics Platform
Actian
Anaconda supporting IPython Notebook, Pandas,Augustus, etc.
Continuum Analytics
ETL
data
prep
predictive
model
data
sources
end
uses
36
37. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
37
38. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
38
39. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
39
40. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
40
41. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
41
42. ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
one connected DAG:
• optimization
• troubleshooting
• exception handling
• notifications
cascading.org
42
43. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
43
44. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
flowDef.addAssemblyPlanner( pmmlPlanner );
44
45. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
to ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
Edgar Codd alluded to this (DSLs for structuring data)
in his original paper about relational model
45
46. Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading –
used for their large-scale production deployments
• new case studies for Cascading apps are mostly based on
domain-specific languages (DSLs) in JVM languages which
emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
46
47. Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
47
48. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
48
49. Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
data is represented as flows of tuples
operations in the flows bring functional
programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
49
50. Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com
50
51. Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
51
52. void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
The Ubiquitous Word Count
Definition:
this simple program provides an excellent test case
for parallel processing:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “HelloWorld” for Hadoop apps
a distributed computing framework that runsWord Count
efficiently in parallel at scale can handle much larger
and more interesting compute problems
count how often each word appears
in a collection of text documents
52
56. A Thought Exercise
Consider that when a company like Caterpillar moves
into data science, they won’t be building the world’s
next search engine or social network
They will be optimizing supply chain, optimizing fuel
costs, automating data feedback loops integrated
into their equipment…
Operations Research –
crunching amazing amounts of data
$50B company, in a $250B market segment
Upcoming: tractors as drones –
guided by complex, distributed data apps
56
58. Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
58
59. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
DSSG, 2013-08-12
59
60. Q3 1997: inflection point
four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
this effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this period
60
61. RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
61
62. RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
“throw it over the wall”
62
63. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
63
64. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
64
65. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
65
66. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
66
67. Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
Primary Sources
67
68. Cluster Computing’s Dirty Little Secret
people like me make a good living by leveraging high ROI
apps based on clusters, and so the execs agree to build
out more data centers…
clusters for Hadoop/HBase, for Storm, for MySQL,
for Memcached, for Cassandra, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization… various notions
of “cloud” help
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
68
69. Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of disruption in topologies
could this imply? because there’s
no such thing as RAM anymore…
69
70. Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
because the data won’t fit on one computer anymore
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem
C A
P
strong
consistency
high
availability
partition
tolerance
eventual
consistency
“You can have at most two of these properties for
any shared-data system… the choice of which
feature to discard determines the nature of your
system.” – Eric Brewer, 2000 (Inktomi/YHOO)
cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem
70
72. “Return of the Borg”
consider that Google is generations ahead of
Hadoop, etc., with much improved ROI on its
data centers…
Borg serves as a kind of “secret sauce” for
data center OS, with Omega as its next
evolution:
2011 GAFS Omega
John Wilkes, et al.
youtu.be/0ZFMlO98Jkc
72
73. “Return of the Borg”
Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon
Cade Metz
wired.com/wiredenterprise/2013/03/google-
borg-twitter-mesos
The Datacenter as a Computer: An Introduction
to the Design ofWarehouse-Scale Machines
Luiz André Barroso, Urs Hölzle
research.google.com/pubs/pub35290.html
73
74. Mesos – definitions
a common substrate for cluster computing
heterogenous assets in your data center or cloud
made available as a homogenous set of resources
• top-level Apache project
• scalability to 10,000s of nodes
• obviates the need for virtual machines
• isolation between tasks with Linux Containers (pluggable)
• fault-tolerant replicated master using ZooKeeper
• multi-resource scheduling (memory and CPU aware)
• APIs in C++, Java, Python
• web UI for inspecting cluster state
• available for Linux, Mac OSX, OpenSolaris
74
75. Mesos – simplifies app development
CHRONOS SPARK HADOOP DPARK MPI
JVM (JAVA, SCALA, CLOJURE, JRUBY)
MESOS
PYTHON C++
75
76. Mesos – data center OS stack
HADOOP STORM CHRONOS RAILS JBOSS
TELEMETRY
Kernel
OS
Apps
MESOS
CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING
76
78. Prior Practice: Dedicated Servers
DATACENTER
• low utilization rates
• longer time to ramp up new services
78
79. Prior Practice: Virtualization
DATACENTER PROVISIONED VMS
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
79
80. Prior Practice: Static Partitioning
DATACENTER STATIC PARTITIONING
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
• static partitioning limits elasticity
80
81. MESOS
Mesos: One Large Pool Of Resources
DATACENTER
“We wanted people to be able to program
for the data center just like they program
for their laptop."
Ben Hindman
81
82. What are the costs of Virtualization?
benchmark
type
OpenVZ
improvement
mixed workloads 210%-300%
LAMP (related) 38%-200%
I/O throughput 200%-500%
response time order magnitude
more pronounced
at higher loads
82
83. What are the costs of Single Tenancy?
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
83
84. Compelling arguments for Data Center OS
• obviates the need forVMs (licensing, adiosVMware)
• provides OS-level building blocks for developing new
distributed frameworks (learning curve, adios Hadoop)
• removes significantVM overhead (performance)
• requires less h/w to buy (CapEx), power and fix (OpEx)
• implies lessVMs, thus less Ops overhead (staff)
• removes the complexity of Chef/Puppet (staff)
• allows higher utilization rates (ROI)
• reduces latency for data updates (OLTP + OLAP on same server)
• reshapes cluster resources dynamically (100’s ms vs. minutes)
• runs dev/test clusters on same h/w as production (flexibility)
• evaluates multiple versions without more h/w (vendor lock-in)
84
85. Opposite Ends of the Spectrum, One Substrate
Built-in /
bare metal
Hypervisors
Solaris Zones
Linux CGroups
85
86. Opposite Ends of the Spectrum, One Substrate
Request /
Response
Batch
86
87. Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure –
it’s how we build all our new services and is critical forTwitter’s
continued success at scale. It's one of the primary keys to our
data center efficiency."
Chris Fry, SVP Engineering
blog.twitter.com/2013/mesos-graduates-from-apache-incubation
• key services run in production: analytics, typeahead, ads
• Twitter engineers rely on Mesos to build all new services
• instead of thinking about static machines, engineers think
about resources like CPU, memory and disk
• allows services to scale and leverage a shared pool of
servers across data centers efficiently
• reduces the time between prototyping and launching
87
88. Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the field of travel
more so than anyone has ever done before… a smaller number
of engineers can have higher impact through automation on
Mesos."
Mike Curtis,VP Engineering
gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...
• improves resource management and efficiency
• helps advance engineering strategy of building small teams
that can move fast
• key to letting engineers make the most of AWS-based
infrastructure beyond just Hadoop
• allowed company to migrate off Elastic MapReduce
• enables use of Hadoop along with Chronos, Spark, Storm, etc.
88
90. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. the evolution of cluster computing
SUMMARY…
DSSG, 2013-08-12
90
91. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
91
92. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
1. End Use Cases, the drivers
92
93. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
2. A new kind of team process
93
94. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
3. Abstraction layer as optimizing
middleware, e.g., Cascading
94
95. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
4. Data Center OS, e.g., Mesos
95
96. Enterprise DataWorkflows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
monthly newsletter for updates, events,
conference summaries, etc.:
liber118.com/pxn/
96