Guest lecture 2013-08-27 at General Assembly in SF for the Data Science program taught by Jacob Bollinger and Thomson Nguyen https://generalassemb.ly/education/data-science/san-francisco
Many thanks to Thomson, Jacob, and the participants in the course. Excellent Q&A!
Received a bottle o' Cardhu (my fave Scotch) in payment for lecture, and since it's Burning Man Week, the city was emptied so we had enough to share with the class :)
Evidence:
https://plus.google.com/u/0/110794698656267747127/posts/GvjhhQ99CTs
An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data Science Teams,
and a 2-year survey of Enterprise Use Cases.
http://www.eventbrite.com/event/7476758185
The document discusses challenges in analytics for big data. It notes that big data refers to data that exceeds the capabilities of conventional algorithms and techniques to derive useful value. Some key challenges discussed include handling the large volume, high velocity, and variety of data types from different sources. Additional challenges include scalability for hierarchical and temporal data, representing uncertainty, and making the results understandable to users. The document advocates for distributed analytics from the edge to the cloud to help address issues of scale.
The Internet of Things, or the IoT is a vision for a ubiquitous society wherein people and “Things” are connected in an immersively networked computing environment, with the connected “Things” providing utility to people/enterprises and their digital shadows, through intelligent social and commercial services. However, translating this idea to a conceivable reality is a work in progress for close to two decades; mostly, due to assumptions favoured more towards a “Things”-centric rather than a “Human”-centric approach coupled with the evolution/deployment ecosystem of IoT technologies.
Estimates on the spread and economic impact of IoT over the next few years are in the neighborhood of 50 billion or more connected “Things” with a market exceeding $350 billion through smarter cities and infrastructure, intelligent appliances, and healthier lifestyles. While many of these potential benefits from IoT are real and achievable, the road to accomplish these may need an rethink.
In the last few years, there has been a realization that an effective architecture for IoT (particularly, for emerging nations with limited technology penetration at the national scale) that is both affordable and sustainable should be based on tangible technology advances in the present, ubiquitous capabilities of the present/future, and practical application scenarios of social and entrepreneurial value. Hence, there is a revitalized interest to rethink the above assumptions, and this exercise has led to a more plausible set of scenarios wherein humans along with data, communication and devices play key roles.
In this presentation, an attempt is made to disaggregate these core problems; and offer a trajectory with a set of design paradigms for a renewed IoT ecosystem.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
The document discusses big data basics, infrastructure, challenges, and use cases. It defines big data as large volumes of structured, semi-structured, and unstructured data that is difficult to process using traditional databases and software. Common big data infrastructure includes clustered network attached storage, object storage, Hadoop, and data appliances like HP Vertica and Terradata Aster. Challenges discussed include log management, data integrity, backup management, and database management in the big data era. Potential big data use cases include modeling risk, customer churn analysis, and recommendation engines.
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
This document provides an overview of big data by discussing its background and definitions. It describes how data has grown exponentially in recent years due to factors like the internet, cloud computing, and internet of things. Big data is defined as data that cannot be processed by traditional technologies due to its huge size, speed of growth, and variety of data types. The document outlines several common definitions of big data, including the 3Vs (volume, velocity, variety) and 4Vs (volume, variety, velocity, value) models. It aims to provide readers with a comprehensive understanding of the emerging field of big data.
An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data Science Teams,
and a 2-year survey of Enterprise Use Cases.
http://www.eventbrite.com/event/7476758185
The document discusses challenges in analytics for big data. It notes that big data refers to data that exceeds the capabilities of conventional algorithms and techniques to derive useful value. Some key challenges discussed include handling the large volume, high velocity, and variety of data types from different sources. Additional challenges include scalability for hierarchical and temporal data, representing uncertainty, and making the results understandable to users. The document advocates for distributed analytics from the edge to the cloud to help address issues of scale.
The Internet of Things, or the IoT is a vision for a ubiquitous society wherein people and “Things” are connected in an immersively networked computing environment, with the connected “Things” providing utility to people/enterprises and their digital shadows, through intelligent social and commercial services. However, translating this idea to a conceivable reality is a work in progress for close to two decades; mostly, due to assumptions favoured more towards a “Things”-centric rather than a “Human”-centric approach coupled with the evolution/deployment ecosystem of IoT technologies.
Estimates on the spread and economic impact of IoT over the next few years are in the neighborhood of 50 billion or more connected “Things” with a market exceeding $350 billion through smarter cities and infrastructure, intelligent appliances, and healthier lifestyles. While many of these potential benefits from IoT are real and achievable, the road to accomplish these may need an rethink.
In the last few years, there has been a realization that an effective architecture for IoT (particularly, for emerging nations with limited technology penetration at the national scale) that is both affordable and sustainable should be based on tangible technology advances in the present, ubiquitous capabilities of the present/future, and practical application scenarios of social and entrepreneurial value. Hence, there is a revitalized interest to rethink the above assumptions, and this exercise has led to a more plausible set of scenarios wherein humans along with data, communication and devices play key roles.
In this presentation, an attempt is made to disaggregate these core problems; and offer a trajectory with a set of design paradigms for a renewed IoT ecosystem.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
The document discusses big data basics, infrastructure, challenges, and use cases. It defines big data as large volumes of structured, semi-structured, and unstructured data that is difficult to process using traditional databases and software. Common big data infrastructure includes clustered network attached storage, object storage, Hadoop, and data appliances like HP Vertica and Terradata Aster. Challenges discussed include log management, data integrity, backup management, and database management in the big data era. Potential big data use cases include modeling risk, customer churn analysis, and recommendation engines.
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
This document provides an overview of big data by discussing its background and definitions. It describes how data has grown exponentially in recent years due to factors like the internet, cloud computing, and internet of things. Big data is defined as data that cannot be processed by traditional technologies due to its huge size, speed of growth, and variety of data types. The document outlines several common definitions of big data, including the 3Vs (volume, velocity, variety) and 4Vs (volume, variety, velocity, value) models. It aims to provide readers with a comprehensive understanding of the emerging field of big data.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
The document discusses developing an integrated framework to utilize big data for higher education institutions in Saudi Arabia. It aims to develop a framework to support decision making and improve performance in education sectors using big data. The study collected data through surveys and interviews to analyze factors affecting adoption and implementation of big data. The framework addresses issues related to adoption of big data in education.
Semantic Web Investigation within Big Data ContextMurad Daryousse
This document discusses how the semantic web can help address challenges associated with big data. It describes the 5 V's of big data: volume, variety, velocity, veracity, and value. For each V, it outlines related challenges in data acquisition, integration, and analysis. The document argues that semantic web concepts like ontologies, linked data, and reasoning can help solve problems of data heterogeneity, scale, and timeliness across different phases of the big data analysis pipeline, in order to ultimately extract value from data.
The current challenges and opportunities of big data and analytics in emergen...IBM Analytics
Big data and analytics present many possibilities for emergency management specialists and first responders. Some of these benefits include pinpointing vulnerabilities, bringing in the right resources and maximizing existing resources to pave the way to adoption. However, these opportunities are not without challenges. Emergency management experts Adam Crowe, Director, Emergency Preparedness at Virginia Commonwealth University; William Moorhead, President of All Clear Emergency Management Group; and Gary Nestler, Associate Partner and Global Leader, Emergency Management solutions at IBM discuss these challenges and opportunities in this slideshare—which is intended to help disaster management stakeholders achieve the most accurate situational awareness using analytics.
Discover analytics solutions for emergency management http://ibm.co/emergencymgmt
This document discusses challenges and outlooks related to big data. It begins with an introduction describing how big data is being collected and analyzed in various fields such as science, education, healthcare, urban planning, and more. It then outlines the key phases in big data analysis: data acquisition and recording, information extraction and cleaning, data integration and representation, query processing and analysis, and result interpretation. For each phase, it discusses challenges and how existing techniques can be applied or extended to address big data issues. Some of the major challenges discussed are data scale, heterogeneity, lack of structure, privacy, timeliness, provenance, and visualization across the entire big data analysis pipeline.
This document discusses big data and its challenges related to the Internet of Things (IoT). It first defines big data and explains how the aggregation of data from many IoT systems can lead to big data. It then discusses some key challenges of big data, including issues with data volume, velocity, variety, and veracity. Specific challenges for big data from IoT systems are also reviewed, such as authentication, security, and uncertainty of data. Finally, the document outlines some potential solutions to big data challenges, such as using MapReduce for heterogeneous data, data cleaning techniques for inconsistencies, and cloud-based security platforms for IoT devices.
Roger hoerl say award presentation 2013Roger Hoerl
This document discusses how statistical engineering principles can help address challenges with "Big Data" projects. It argues that simply having powerful algorithms and large datasets does not guarantee good models or results. The leadership challenge for statisticians is to ensure Big Data projects are built on sound modeling foundations rather than hype. Statistical engineering principles like understanding data quality, using sequential approaches, and integrating subject matter knowledge can help improve the success of Big Data analyses and provide the statistical profession an opportunity for leadership in this area. Statistical engineering provides a framework to structure Big Data projects and incorporate fundamentals of good science that are sometimes overlooked.
In this presentation, I have talked about Big Data and its importance in brief. I have included the very basics of Data Science and its importance in the present day, through a case study. You can also get an idea about who a data scientist is and what all tasks he performs. A few applications of data science have been illustrated in the end.
A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
The presentation is about the career path in the field of Data Science. Data Science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
The document provides an overview of data science applications and use cases. It defines data science as using computer science, statistics, machine learning and other techniques to analyze data and create data products to help businesses make better decisions. It discusses big data challenges, the differences between data science and software engineering, and key areas of data science competence including data analytics, engineering, domain expertise and data management. Finally, it outlines several common data science applications and use cases such as recommender systems, credit scoring, dynamic pricing, customer churn analysis and fraud detection with examples of how each works and real world cases.
This document provides an introduction to business analytics. It discusses how analytics has evolved from simple number crunching to a competitive strategy that is driving innovation. It explains the importance of analytics in decision making and its impact on organizational performance. Examples are given of companies that use analytics successfully, like Amazon's recommender system. The document outlines the data-driven decision making process and how analytics is used across organizations to solve problems and make decisions at different levels from process improvement to competitive strategy.
The document discusses big data analytics, including its characteristics, tools, and applications. It defines big data analytics as the application of advanced analytics techniques to large datasets. Big data is characterized by its volume, variety, and velocity. New tools and methods are needed to store, manage, and analyze big data. The document reviews different big data storage, processing, and analytics tools and methods that can be applied in decision making.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
This document provides an overview of data science. It defines data as facts such as numbers, words, measurements, and descriptions. Data science involves developing methods to analyze and extract useful insights from both structured and unstructured data. While data mining focuses on analyzing large datasets, data science covers the entire data lifecycle. There is a growing demand for data scientists as every industry relies on data. Data scientists use various statistical techniques to find patterns in data and gain knowledge. Netflix is used as a case study to show how it has become a data-driven business that uses data science to power recommendations and improve the customer experience.
This document discusses uncertainty in big data analytics. It begins by providing background on big data, defining the common "5 V's" characteristics of big data - volume, variety, velocity, veracity, and value. It then discusses uncertainty, which exists in big data due to noise, incompleteness, and inconsistency in data. The document surveys techniques for big data analytics and how uncertainty impacts machine learning, natural language processing, and other artificial intelligence approaches. It identifies challenges that uncertainty presents and strategies for mitigating uncertainty in big data analytics.
This document discusses big data and its characteristics. It provides examples of how companies like Walmart and Facebook handle large amounts of data. It defines big data and describes the types of data: structured, unstructured, and semi-structured. The key characteristics of big data are identified as volume, variety, velocity, and variability. The document concludes that with billions more people gaining internet access, the amount of data will continue growing exponentially and we have only begun to see the potential of big data.
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
The document discusses the paradigm shift in social science research enabled by big data. Key points:
- Advances in data collection technologies and analytics tools now allow researchers to study social phenomena at an unprecedented scale, depth, and scope. This represents a potential scientific paradigm shift toward computational social science.
- Factors driving this shift include the massive growth of digital data from various sources, reduced costs of data collection, and new capabilities for large-scale empirical research.
- The new approaches enabled by big data help address traditional tradeoffs in research between generalizability, control, and realism. Large, unobtrusively collected data sets allow for more realistic and controlled studies of real-world phenomena.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
The document discusses developing an integrated framework to utilize big data for higher education institutions in Saudi Arabia. It aims to develop a framework to support decision making and improve performance in education sectors using big data. The study collected data through surveys and interviews to analyze factors affecting adoption and implementation of big data. The framework addresses issues related to adoption of big data in education.
Semantic Web Investigation within Big Data ContextMurad Daryousse
This document discusses how the semantic web can help address challenges associated with big data. It describes the 5 V's of big data: volume, variety, velocity, veracity, and value. For each V, it outlines related challenges in data acquisition, integration, and analysis. The document argues that semantic web concepts like ontologies, linked data, and reasoning can help solve problems of data heterogeneity, scale, and timeliness across different phases of the big data analysis pipeline, in order to ultimately extract value from data.
The current challenges and opportunities of big data and analytics in emergen...IBM Analytics
Big data and analytics present many possibilities for emergency management specialists and first responders. Some of these benefits include pinpointing vulnerabilities, bringing in the right resources and maximizing existing resources to pave the way to adoption. However, these opportunities are not without challenges. Emergency management experts Adam Crowe, Director, Emergency Preparedness at Virginia Commonwealth University; William Moorhead, President of All Clear Emergency Management Group; and Gary Nestler, Associate Partner and Global Leader, Emergency Management solutions at IBM discuss these challenges and opportunities in this slideshare—which is intended to help disaster management stakeholders achieve the most accurate situational awareness using analytics.
Discover analytics solutions for emergency management http://ibm.co/emergencymgmt
This document discusses challenges and outlooks related to big data. It begins with an introduction describing how big data is being collected and analyzed in various fields such as science, education, healthcare, urban planning, and more. It then outlines the key phases in big data analysis: data acquisition and recording, information extraction and cleaning, data integration and representation, query processing and analysis, and result interpretation. For each phase, it discusses challenges and how existing techniques can be applied or extended to address big data issues. Some of the major challenges discussed are data scale, heterogeneity, lack of structure, privacy, timeliness, provenance, and visualization across the entire big data analysis pipeline.
This document discusses big data and its challenges related to the Internet of Things (IoT). It first defines big data and explains how the aggregation of data from many IoT systems can lead to big data. It then discusses some key challenges of big data, including issues with data volume, velocity, variety, and veracity. Specific challenges for big data from IoT systems are also reviewed, such as authentication, security, and uncertainty of data. Finally, the document outlines some potential solutions to big data challenges, such as using MapReduce for heterogeneous data, data cleaning techniques for inconsistencies, and cloud-based security platforms for IoT devices.
Roger hoerl say award presentation 2013Roger Hoerl
This document discusses how statistical engineering principles can help address challenges with "Big Data" projects. It argues that simply having powerful algorithms and large datasets does not guarantee good models or results. The leadership challenge for statisticians is to ensure Big Data projects are built on sound modeling foundations rather than hype. Statistical engineering principles like understanding data quality, using sequential approaches, and integrating subject matter knowledge can help improve the success of Big Data analyses and provide the statistical profession an opportunity for leadership in this area. Statistical engineering provides a framework to structure Big Data projects and incorporate fundamentals of good science that are sometimes overlooked.
In this presentation, I have talked about Big Data and its importance in brief. I have included the very basics of Data Science and its importance in the present day, through a case study. You can also get an idea about who a data scientist is and what all tasks he performs. A few applications of data science have been illustrated in the end.
A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
The presentation is about the career path in the field of Data Science. Data Science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
The document provides an overview of data science applications and use cases. It defines data science as using computer science, statistics, machine learning and other techniques to analyze data and create data products to help businesses make better decisions. It discusses big data challenges, the differences between data science and software engineering, and key areas of data science competence including data analytics, engineering, domain expertise and data management. Finally, it outlines several common data science applications and use cases such as recommender systems, credit scoring, dynamic pricing, customer churn analysis and fraud detection with examples of how each works and real world cases.
This document provides an introduction to business analytics. It discusses how analytics has evolved from simple number crunching to a competitive strategy that is driving innovation. It explains the importance of analytics in decision making and its impact on organizational performance. Examples are given of companies that use analytics successfully, like Amazon's recommender system. The document outlines the data-driven decision making process and how analytics is used across organizations to solve problems and make decisions at different levels from process improvement to competitive strategy.
The document discusses big data analytics, including its characteristics, tools, and applications. It defines big data analytics as the application of advanced analytics techniques to large datasets. Big data is characterized by its volume, variety, and velocity. New tools and methods are needed to store, manage, and analyze big data. The document reviews different big data storage, processing, and analytics tools and methods that can be applied in decision making.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
This document provides an overview of data science. It defines data as facts such as numbers, words, measurements, and descriptions. Data science involves developing methods to analyze and extract useful insights from both structured and unstructured data. While data mining focuses on analyzing large datasets, data science covers the entire data lifecycle. There is a growing demand for data scientists as every industry relies on data. Data scientists use various statistical techniques to find patterns in data and gain knowledge. Netflix is used as a case study to show how it has become a data-driven business that uses data science to power recommendations and improve the customer experience.
This document discusses uncertainty in big data analytics. It begins by providing background on big data, defining the common "5 V's" characteristics of big data - volume, variety, velocity, veracity, and value. It then discusses uncertainty, which exists in big data due to noise, incompleteness, and inconsistency in data. The document surveys techniques for big data analytics and how uncertainty impacts machine learning, natural language processing, and other artificial intelligence approaches. It identifies challenges that uncertainty presents and strategies for mitigating uncertainty in big data analytics.
This document discusses big data and its characteristics. It provides examples of how companies like Walmart and Facebook handle large amounts of data. It defines big data and describes the types of data: structured, unstructured, and semi-structured. The key characteristics of big data are identified as volume, variety, velocity, and variability. The document concludes that with billions more people gaining internet access, the amount of data will continue growing exponentially and we have only begun to see the potential of big data.
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
The document discusses the paradigm shift in social science research enabled by big data. Key points:
- Advances in data collection technologies and analytics tools now allow researchers to study social phenomena at an unprecedented scale, depth, and scope. This represents a potential scientific paradigm shift toward computational social science.
- Factors driving this shift include the massive growth of digital data from various sources, reduced costs of data collection, and new capabilities for large-scale empirical research.
- The new approaches enabled by big data help address traditional tradeoffs in research between generalizability, control, and realism. Large, unobtrusively collected data sets allow for more realistic and controlled studies of real-world phenomena.
On Digital Markets, Data, and Concentric DiversificationBernhard Rieder
This document discusses how large tech companies like Google and Facebook have expanded from their original businesses through a strategy of concentric diversification. It argues that their accumulation of large data assets and algorithmic capabilities allows them to computerize new domains. For example, Google uses its knowledge bases and machine learning to expand from search into areas like self-driving cars. Facebook leverages its social graph and identity resolution to enter new ad tech businesses. The document analyzes how these companies' technological systems grow more valuable as their assets transfer to new sectors, creating economies of scale that affect market dynamics and relationships between firms.
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
1) The document discusses opportunities and challenges of using Big Data for official statistics. It describes Big Data as data that is difficult to collect, store, or process using conventional statistical systems due to issues of volume, velocity, structure, or variety.
2) The author outlines their experiences at Statistics Netherlands using various Big Data sources like traffic sensor data, mobile phone data, and social media data. They discuss methodological challenges in accessing and analyzing large volumes of data, dealing with noisy and unstructured data, and addressing issues of selectivity.
3) The document emphasizes the need for new skills like data science, high performance computing, and people with open and pragmatic mindsets to work with Big Data. It also addresses privacy
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
Toward a System Building Agenda for Data Integration
(and Data Science)
AnHai Doan, Pradap Konda, Paul Suganthan G.C., Adel Ardalan, Jeffrey R. Ballard, Sanjib Das,
Yash Govind, Han Li, Philip Martinkus, Sidharth Mudgal, Erik Paulson, Haojun Zhang
University of Wisconsin-Madison
Abstract
We argue that the data integration (DI) community should devote far more effort to building systems,
in order to truly advance the field. We discuss the limitations of current DI systems, and point out that
there is already an existing popular DI “system” out there, which is PyData, the open-source ecosystem
of 138,000+ interoperable Python packages. We argue that rather than building isolated monolithic DI
systems, we should consider extending this PyData “system”, by developing more Python packages that
solve DI problems for the users of PyData. We discuss how extending PyData enables us to pursue an
integrated agenda of research, system development, education, and outreach in DI, which in turn can
position our community to become a key player in data science. Finally, we discuss ongoing work at
Wisconsin, which suggests that this agenda is highly promising and raises many interesting challenges.
1 Introduction
In this paper we focus on data integration (DI), broadly interpreted as covering all major data preparation steps
such as data extraction, exploration, profiling, cleaning, matching, and merging [10]. This topic is also known
as data wrangling, munging, curation, unification, fusion, preparation, and more. Over the past few decades, DI
has received much attention (e.g., [37, 29, 31, 20, 34, 33, 6, 17, 39, 22, 23, 5, 8, 36, 15, 35, 4, 25, 38, 26, 32, 19,
2, 12, 11, 16, 2, 3]). Today, as data science grows, DI is receiving even more attention. This is because many
data science applications must first perform DI to combine the raw data from multiple sources, before analysis
can be carried out to extract insights.
Yet despite all this attention, today we do not really know whether the field is making good progress. The
vast majority of DI works (with the exception of efforts such as Tamr and Trifacta [36, 15]) have focused on
developing algorithmic solutions. But we know very little about whether these (ever-more-complex) algorithms
are indeed useful in practice. The field has also built mostly isolated system prototypes, which are hard to use and
combine, and are often not powerful enough for real-world applications. This makes it difficult to decide what
to teach in DI classes. Teaching complex DI algorithms and asking students to do projects using our prototype
systems can train them well for doing DI research, but are not likely to train them well for solving real-world DI
problems in later jobs. Similarly, outreach to real users (e.g., domain scientists) is difficult. Given that we have
Copyright 0000 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for
advertising or promotional purpose ...
Applications of Big Data Analytics in BusinessesT.S. Lim
The document discusses big data and big data analytics. It begins with definitions of big data from various sources that emphasize the large volumes of structured and unstructured data. It then discusses key aspects of big data including the three Vs of volume, variety, and velocity. The document also provides examples of big data applications in various industries. It explains common analytical methods used in big data including linear regression, decision trees, and neural networks. Finally, it discusses popular tools and frameworks for big data analytics.
McKinsey Global Institute Big data The next frontier for innova.docxandreecapon
McKinsey Global Institute
Big data: The next frontier for innovation, competition, and productivity 27
2. Bigdatatechniquesand technologies
A wide variety of techniques and technologies has been developed and adapted to aggregate, manipulate, analyze, and visualize big data. These techniques and technologies draw from several fields including statistics, computer science, applied mathematics, and economics. This means that an organization that intends to derive value from big data has to adopt a flexible, multidisciplinary approach. Some techniques and technologies were developed in a world with access to far smaller volumes and variety in data, but have been successfully adapted so that they are applicable to very large sets of more diverse data. Others have been developed more recently, specifically to capture value from big data. Some were developed by academics and others by companies, especially those with online business models predicated on analyzing big data.
This report concentrates on documenting the potential value that leveraging big data can create. It is not a detailed instruction manual on how to capture value, a task that requires highly specific customization to an organization’s context, strategy, and capabilities. However, we wanted to note some of the main techniques and technologies that can be applied to harness big data to clarify the way some
of the levers for the use of big data that we describe might work. These are not comprehensive lists—the story of big data is still being written; new methods and tools continue to be developed to solve new problems. To help interested readers find a particular technique or technology easily, we have arranged these lists alphabetically. Where we have used bold typefaces, we are illustrating the multiple interconnections between techniques and technologies. We also provide a brief selection of illustrative examples of visualization, a key tool for understanding very large-scale data and complex analyses in order to make better decisions.
TECHNIQUES FOR ANALYZING BIG DATA
There are many techniques that draw on disciplines such as statistics and computer science (particularly machine learning) that can be used to analyze datasets. In this section, we provide a list of some categories of techniques applicable across a range of industries. This list is by no means exhaustive. Indeed, researchers continue to develop new techniques and improve on existing ones, particularly in response to the need
to analyze new combinations of data. We note that not all of these techniques strictly require the use of big data—some of them can be applied effectively to smaller datasets (e.g., A/B testing, regression analysis). However, all of the techniques we list here can be applied to big data and, in general, larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones.
A/B testing. A technique in which a control group is compa ...
Big data presents challenges at the data, model, and system levels. At the data level, issues include heterogeneous sources, missing/uncertain values, and privacy/errors. At the model level, generating global models from local patterns is difficult. At the system level, linking complex relationships between data sources and handling growth is challenging. Addressing these issues requires high-performance computing, algorithms to analyze distributed data and models, and carefully designed systems to form useful patterns from unstructured data and identify trends over time. Big data technologies may help provide more accurate social sensing and understanding.
The document provides an overview of data science and what it entails. It discusses the hype around big data and data science, and how data science has evolved due to improvements in technology that allow for large-scale data processing. It defines data science as a process that involves collecting, cleaning, analyzing and extracting meaningful insights from data. Data scientists come from a variety of academic backgrounds and work in both industry and academia developing solutions to real-world problems using data-driven approaches.
This document discusses the need for a new paradigm in big data analytics using algorithms. It begins by describing the limitations of traditional analytics approaches like statistical analysis, data mining, visualization and business intelligence tools when applied to big data. These approaches are query-based and labor intensive. Emerging big data tools like Hadoop and in-memory databases help with storage and queries but do not provide automated insights. The document argues that the new paradigm should focus on algorithms that can automatically surface insights from data in seconds, replacing the need for data analysts to manually query databases. This represents a shift from humans digging for insights to algorithms surfacing insights for humans to evaluate.
At Ikeen Technologies, we combine expertise in various domains, including software development, web design, data analytics, artificial intelligence, and cloud computing, to offer comprehensive solutions that meet the unique needs of our clients. Our team of skilled professionals possesses deep industry knowledge and technical
Techeduxon is a cutting-edge technology company that specializes in developing innovative solutions and educational tools for the field of education. With a strong focus on integrating technology into learning environments, Techeduxon aims to enhance the educational experience for students and educators alike.
At Techeduxon, a team of passionate engineers, designers, and educators collaborate to create high-quality products that address the evolving needs of modern education. Their range of offerings includes software applications, interactive learning platforms, hardware devices, and curriculum resources.
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
The document provides an overview of data analytics and big data concepts. It discusses the characteristics of big data, including the four V's of volume, velocity, variety and veracity. It also describes different types of data like structured, semi-structured and unstructured data. The document then introduces big data platforms and tools like Hadoop, Spark and Cassandra. Finally, it discusses the need for data analytics in business, including enabling better decision making and improving efficiency.
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA DATASCIENCE
The document provides a brief history of data science as an academic field, outlines current data science educational programs, and predicts future trends. It notes that data science emerged from a series of academic concepts in the 1960s and gained popularity in the 2000s. Currently, there are three main educational paradigms for data science: business analytics, data science, and data engineering. The document also discusses best practices in data science education and predicts that future trends may include increased specialization, collaboration between education and industry, and evolving skills demands.
Big Data: Are you ready for it? Can you handle it? ScaleFocus
Big data presents both opportunities and challenges for companies. It provides a competitive advantage but organizing, analyzing, and drawing accurate conclusions from vast amounts of unsorted data can be difficult. Companies must critically examine their data to avoid making miscalculations from biases, gaps, or false senses of reliability. Technical solutions like Hadoop can help by supporting flexible handling of multiple data sources at low cost for tasks like data staging, processing, and archiving. However, big data requires experienced teams to ask the right questions and leverage these tools to accomplish business goals, rather than viewing them as guarantees of success. Companies must assess their readiness by considering resources, change management, success criteria, and partner selection.
This document proposes an Impact Monitoring Framework to measure the impact of open data. It combines the principles of Social Return on Investment (SROI) with existing open data impact literature. The framework outlines a theory of change model with inputs, outputs, outcomes and impacts. Inputs refer to resources used to publish data. Outputs are deliverables like open data portals. Outcomes are re-use activities by third parties. Impacts adjust outcomes to estimate effects caused by open data alone. The framework could help focus resources on high-impact activities and improve new initiatives. Empirical testing of real open data projects is needed to validate the framework.
This document discusses data science career paths and the role of a data scientist. It defines data science as the scientific process of transforming data into insights to make better decisions. Data scientists are skilled at statistics, software engineering, machine learning, and communicating findings. The document outlines common data science career paths including roles in fraud detection analyzing social media analytics. It also lists important skills for data scientists such as data mining, machine learning, statistics, visualization, programming, and working with big data. Finally, it provides an example of tasks a data scientist might complete in a typical day.
Data and Analytics Career Paths, Presented at IEEE LYC'19.
About Speaker:
Ahmed Amr is a Data/Analytics Engineer at Rubikal, where he leads, develops, and creates daily data/analytics operations, which includes data ingestion , data streaming, data warehousing, and analytical dashboards. Ahmed is graduated from Computer Engineering Department, Alexandria University; and he is currently pursuing his MSc degree in Computer Science, AAST. Professionally, Ahmed worked with Egyptian/US startups such as (Badr, Incorta, WhoKnows) to develop their data/analytics projects. Academically, Ahmed worked as a Teaching Assistant in CS department, AAST. Ahmed helps software companies to develop robust data engineering infrastructure, and powerful analytical insights.
References:
1) https://www.datacamp.com/community/tutorials/data-science-industry-infographic
2) Analytics: The real-world use of big data, IBM, Executive Report
This document summarizes several research projects related to big data and social science knowledge. It discusses projects that analyzed large social media platforms like Facebook, Twitter, and Wikipedia to study information diffusion and social influences. It also discusses challenges like securing access to commercial data and ensuring replicability of findings. Examples demonstrate how big data can provide novel insights but are limited by the objects studied and incomplete representation of populations. The document discusses debates around the implications of big data for privacy, prediction, exclusion, and manipulation. It argues that knowledge depends on how research technologies advance knowledge within ethical and legal frameworks.
Similar to Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos (20)
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
O'Reilly Media has experimented with different uses of Jupyter notebooks in their publications and learning platforms. Their latest approach embeds notebooks with video narratives in online "Oriole" tutorials, allowing authors to create interactive, computable content. This new medium blends code, data, text, and video into narrated learning experiences that run in isolated Docker containers for higher engagement. Some best practices for using notebooks in teaching include focusing on concise concepts, chunking content, and alternating between text, code, and outputs to keep explanations clear and linear.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
The document discusses how data science may reinvent learning and education. It begins with background on the author's experience in data teams and teaching. It then questions what an "Uber for education" may look like and discusses definitions of learning, education, and schools. The author argues interactive notebooks like Project Jupyter and flipped classrooms can improve learning at scale compared to traditional lectures or MOOCs. Content toolchains combining Jupyter, Thebe, Atlas and Docker are proposed for authoring and sharing computational narratives and code-as-media.
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
O'Reilly Learning is focusing on evolving learning experiences using Jupyter notebooks. Jupyter notebooks allow combining code, outputs, and explanations in a single document. O'Reilly is using Jupyter notebooks as a new authoring environment and is exploring features like computational narratives, code as a medium for teaching, and interactive online learning environments. The goal is to provide a better learning architecture and content workflow that leverages the capabilities of Jupyter notebooks.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
The document provides an overview of Graph Analytics in Spark. It discusses Spark components and key distinctions from MapReduce. It also covers GraphX terminology and examples of composing node and edge RDDs into a graph. The document provides examples of simple traversals and routing problems on graphs. It discusses using GraphX for topic modeling with LDA and provides further reading resources on GraphX, algebraic graph theory, and graph analysis tools and frameworks.
Graph analytics can be used to analyze a social graph constructed from email messages on the Spark user mailing list. Key metrics like PageRank, in-degrees, and strongly connected components can be computed using the GraphX API in Spark. For example, PageRank was computed on the 4Q2014 email graph, identifying the top contributors to the mailing list.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
The document provides an overview of real-time analytics using Spark Streaming. It discusses Spark Streaming's micro-batch approach of treating streaming data as a series of small batch jobs. This allows for low-latency analysis while integrating streaming and batch processing. The document also covers Spark Streaming's fault tolerance mechanisms and provides several examples of companies like Pearson, Guavus, and Sharethrough using Spark Streaming for real-time analytics in production environments.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Large Language Model (LLM) and it’s Geospatial Applications
Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos
1. General Assembly SF, 2013-08-27:
“Data Center Computing for Data Science:
an evolution of machines, middleware, math, and Mesos”
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data ScienceTeams,
and a 2-year survey of Enterprise Use Cases
Paco Nathan @pacoid
Chief Scientist, Mesosphere
1Saturday, 31 August 13
2. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-27
2Saturday, 31 August 13
3. employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking
3Saturday, 31 August 13
4. Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced the prior practices
of data modeling
because the data won’t fit on one computer anymore
4Saturday, 31 August 13
5. Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
chronicled a sea change from data modeling (silos, manual
process) to the rising use of algorithmic modeling (machine
data for automation/optimization) which led in turn to the
practice of leveraging inter-disciplinary teams
5Saturday, 31 August 13
6. approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log files, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks that can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to socialize the problems, knocking down silos
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making process repeatable
What is needed most?
UniqueRegistration
aunchedgameslobby
NUI:TutorialMode
BirthdayMessage
hatPublicRoomvoice
unchedheyzapgame
Test:testsuitestarted
CreateNewPet
rted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
paceremaining:512M
aseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
anelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
sspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
anelRemoveProduct
yPanelApplyProduct
NUI:DressUpMode
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ConnectivityTest:testsuitestarted
CreateNewPet
MovieViewStarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
Addressspaceremaining:512M
CustomerMadePurchaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
ClientInventoryPanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Addressspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
ClientInventoryPanelRemoveProduct
ClientInventoryPanelApplyProduct
NUI:DressUpMode
6Saturday, 31 August 13
7. apps
discovery
modeling
integration
systems
help people ask the
right questions
allow automation to
place informed bets
deliver data products
at scale to LOB end uses
build smarts into
product features
keep infrastructure
running, cost-effective
Team Process = Needs
analysts
engineers
inter-disciplinary
leadership
7Saturday, 31 August 13
8. business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
Team Composition = Roles
leverage non-traditional
pairing among roles, to
complement skills and
tear down silos
8Saturday, 31 August 13
10. Alternatively, Data Roles × Skill Sets
Harlan Harris, et al.
datacommunitydc.org/blog/wp-content/uploads/
2012/08/SkillsSelfIDMosaic-edit-500px.png
Analyzing the Analyzers
Harlan Harris, Sean Murphy,
Marck Vaisman
O’Reilly, 2013
amazon.com/dp/B00DBHTE56
10Saturday, 31 August 13
11. Learning Curves
difficulties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codified practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that anti-pattern leads to big teams, low ROI
scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context
11Saturday, 31 August 13
12. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-27
12Saturday, 31 August 13
13. Business Disruption through Data
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm
@Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the
entire Global 1000 on notice over the next decade…
data as the major force… mostly through apps –
verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc.
@XLDB, 2012:
complex analytics workloads are now displacing SQL
as the basis for Enterprise apps
13Saturday, 31 August 13
14. Data Categories
Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
let’s now add other useful distinctions:
• Open Data
• Curated Metadata
• A/D conversion for sensors (IoT)
14Saturday, 31 August 13
15. Open Data notes
successful apps incorporate three components:
• Big Data (consumer interest, personalization)
• Open Data (monetizing public data)
• Curated Metadata
most of the largest Cascading deployments leverage some
Open Data components: Climate Corp, Factual, Nokia, etc.
consider buildingeye.com, aggregate building permits:
• pricing data for home owners looking to remodel
• sales data for contractors
• imagine joining data with building inspection history,
for better insights about properties for sale…
research notes about
Open Data use cases:
goo.gl/cd995T
15Saturday, 31 August 13
16. Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)
as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930s
as a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)
relationships, behavioral-based → policy not separate from politics
late 1940s – 1980s
yet another form of management → less “command and control”
1980s – 1990s (David Osborne,Ted Gaebler)
New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)
Digital Age → transparency, citizen-based “debugging”, bankruptcies
Adapted from:
The Roles,Actors, and Norms Necessary to
Institutionalize Sustainable Collaborative Governance
Peter Pirnejad
USC Price School of Policy
2013-05-02
Drivers, circa 2013
• governments have run out of money,
cannot increase staff and services
• better data infra at scale (cloud, OSS, etc.)
• machine learning techniques to monetize
• viable ecosystem for data products,APIs
• mobile devices enabling use cases
16Saturday, 31 August 13
17. Open Data ecosystem
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Data feeds structured for
public private partnerships
17Saturday, 31 August 13
18. Open Data ecosystem – caveats for agencies
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• respond to viable use cases
• not budgeting hackathons
18Saturday, 31 August 13
19. Open Data ecosystem – caveats for publishers
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• surface the metadata
• curate, allowing for joins/aggregation
• not scans as PDFs
19Saturday, 31 August 13
20. Open Data ecosystem – caveats for aggregators
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• make APIs consumable by automation
• allow for probabilistic usage
• not OSS licensing for data
20Saturday, 31 August 13
21. Open Data ecosystem – caveats for data vendors
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• supply actionable data
• track data provenance carefully
• provide feedback upstream,
i.e., cleaned data at source
• focus on core verticals
21Saturday, 31 August 13
22. Open Data ecosystem – caveats for end uses
municipal
departments
publishing
platforms
aggregators
data product
vendors
end use
cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap,WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• address consumer needs
• identify community benefits
of the data
22Saturday, 31 August 13
23. algorithmic modeling
+ machine data (Big Data)
+ curation, metadata
+ Open Data
data products, as feedback into automation
evolution of feedback loops
less about “bigness”, more about complexity
internet of things
+ A/D conversion
+ more complex analytics
accelerated evolution, additional feedback loops
orders of magnitude higher data rates
Recipes for Success
source: National Geographic
“A kind of Cambrian explosion”
source: National Geographic
23Saturday, 31 August 13
24. Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc.,
plus the effects of Google Glass…
technologyreview.com/...
24Saturday, 31 August 13
26. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-27
26Saturday, 31 August 13
27. in general, apps alternate between learning patterns/rules
and retrieving similar things…
machine learning – scalable, arguably quite ad-hoc,
generally “black box” solutions, enabling you to make billion
dollar mistakes, with oh so much commercial emphasis
(i.e. the “heavy lifting”)
statistics – rigorous, much slower to evolve, confidence
and rationale become transparent, preventing you from
making billion dollar mistakes, any good commercial project
has ample stats work used in QA
(i.e.,“CYA, cover your analysis”)
once Big Data projects get beyond merely digesting
log files, optimization will likely become the next
overused buzzword :)
Learning Theory
27Saturday, 31 August 13
28. Generalizations about Machine Learning…
great introduction to ML, plus a proposed categorization
for comparing different machine learning approaches:
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
toward a categorization for Machine Learning algorithms:
• representation: classifier must be represented in some
formal language that computers can handle (algorithms, data
structures, etc.)
• evaluation: evaluation function (objective function, scoring
function) is needed to distinguish good classifiers from bad
ones
• optimization: method to search among the classifiers in
the language for the highest-scoring one
28Saturday, 31 August 13
29. Something to consider about Algorithms…
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, U Maryland
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead of e-commerce in
terms of data rates and sophisticated algorithms work – as Breiman
suggested in 2001 – may take a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
29Saturday, 31 August 13
30. Make It Sparse…
also, take a moment to check this out…
(and related work on sparse Cholesky, etc.)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efficient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
30Saturday, 31 August 13
31. Sparse Matrix Collection
for those times when you really, really need
a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collection
cise.ufl.edu/research/sparse/matrices/
Tim Davis, U Florida
cise.ufl.edu/~davis/welcome.html
Yifan Hu, AT&T Research
www2.research.att.com/~yifanhu/
31Saturday, 31 August 13
32. A Winning Approach…
consider that if you know priors about a system, then
you may be able to leverage low dimensional structure
within high dimensional data… what impact does that
have on sampling rates?
1. real-world data
2. graph theory for representation
3. sparse matrix factorization for production work
4. cost-effective parallel processing
for machine learning app at scale
32Saturday, 31 August 13
33. Just Enough Mathematics?
having a solid background in statistics becomes vital,
because it provides formalisms for what we’re trying
to accomplish at scale
along with that, some areas of math help – regardless
of the “calculus threshold” invoked at many universities…
linear algebra e.g., calculating algorithms for large-scale apps efficiently
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI
33Saturday, 31 August 13
34. ADMM: a general approach for optimizing learners
Distributed Optimization and Statistical Learning
via the Alternating Direction Method of Multipliers
Stephen Boyd, Neal Parikh, et al., Stanford
stanford.edu/~boyd/papers/admm_distr_stats.html
“Throughout, the focus is on applications rather than theory, and a main goal is
to provide the reader with a kind of ‘toolbox’ that can be applied in many situations
to derive and implement a distributed algorithm of practical use.Though the focus
here is on parallelism, the algorithm can also be used serially, and it is interesting
to note that with no tuning, ADMM can be competitive with the best known
methods for some problems.”
“While we have emphasized applications that can be concisely explained, the
algorithm would also be a natural fit for more complicated problems in areas
like graphical models. In addition, though our focus is on statistical learning
problems, the algorithm is readily applicable in many other cases, such as in
engineering design, multi-period portfolio optimization, time series analysis,
network flow, or scheduling.”
34Saturday, 31 August 13
35. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-27
35Saturday, 31 August 13
36. Enterprise Data Workflows
middleware for Big Data applications is evolving,
with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
ParAccel Big Data Analytics Platform
Actian
Anaconda supporting IPython Notebook, Pandas,Augustus, etc.
Continuum Analytics
ETL
data
prep
predictive
model
data
sources
end
uses
36Saturday, 31 August 13
37. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
37Saturday, 31 August 13
38. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
38Saturday, 31 August 13
39. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
39Saturday, 31 August 13
40. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
40Saturday, 31 August 13
41. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
41Saturday, 31 August 13
42. ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
one connected DAG:
• optimization
• troubleshooting
• exception handling
• notifications
cascading.org
42Saturday, 31 August 13
43. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
43Saturday, 31 August 13
44. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
flowDef.addAssemblyPlanner( pmmlPlanner );
44Saturday, 31 August 13
45. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
to ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
Edgar Codd alluded to this (DSLs for structuring data)
in his original paper about relational model
45Saturday, 31 August 13
46. Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading –
used for their large-scale production deployments
• new case studies for Cascading apps are mostly based on
domain-specific languages (DSLs) in JVM languages which
emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
46Saturday, 31 August 13
47. Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
data is represented as flows of tuples
operations in the flows bring functional
programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
47Saturday, 31 August 13
48. Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
48Saturday, 31 August 13
49. void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
The Ubiquitous Word Count
Definition:
this simple program provides an excellent test case
for parallel processing:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “HelloWorld” for Hadoop apps
a distributed computing framework that runsWord Count
efficiently in parallel at scale can handle much larger
and more interesting compute problems
count how often each word appears
in a collection of text documents
49Saturday, 31 August 13
53. (ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
53Saturday, 31 August 13
54. github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
54Saturday, 31 August 13
55. import com.twitter.scalding._
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
55Saturday, 31 August 13
56. github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
56Saturday, 31 August 13
57. CREATE TABLE text_docs (line STRING);
LOAD DATA LOCAL INPATH 'data/rain.txt'
OVERWRITE INTO TABLE text_docs
;
SELECT
word, COUNT(*)
FROM
(SELECT
split(line, 't')[1] AS text
FROM text_docs
) t
LATERAL VIEW explode(split(text, '[ ,.()]')) lTable AS
word
GROUP BY word
;
WordCount – Apache Hive
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
57Saturday, 31 August 13
58. WordCount – Apache Hive
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
hive.apache.org
pro:
‣ most popular abstraction atop Apache Hadoop
‣ SQL-like language is syntactically familiar to most analysts
‣ simple to load large-scale unstructured data and run ad-hoc queries
con:
‣ not a relational engine, many surprises at scale
‣ difficult to represent complex workflows, ML algorithms, etc.
‣ one poorly-trained analyst can bottleneck an entire cluster
‣ app-level integration requires other coding, outside of script language
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may change unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
58Saturday, 31 August 13
59. docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';
-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;
-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
WordCount – Apache Pig
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
59Saturday, 31 August 13
60. WordCount – Apache Pig
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
pig.apache.org
pro:
‣ easy to learn data manipulation language (DML)
‣ interactive prompt (Grunt) makes it simple to prototype apps
‣ extensibility through UDFs
con:
‣ not a full programming language; must extend via UDFs outside of language
‣ app-level integration requires other coding, outside of script language
‣ simple problems are simple to do; hard problems become quite complex
‣ difficult to parameterize scripts externally; must rewrite to change taps!
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may changes unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
60Saturday, 31 August 13
61. Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
61Saturday, 31 August 13
62. Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-27
62Saturday, 31 August 13
63. Q3 1997: inflection point
four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
this effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this period
63Saturday, 31 August 13
64. RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
64Saturday, 31 August 13
65. RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
“throw it over the wall”
65Saturday, 31 August 13
66. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
66Saturday, 31 August 13
67. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
67Saturday, 31 August 13
68. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
68Saturday, 31 August 13
69. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
69Saturday, 31 August 13
70. Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
Primary Sources
70Saturday, 31 August 13
71. Cluster Computing’s Dirty Little Secret
people like me make a good living by leveraging high ROI
apps based on clusters, and so the execs agree to build
out more data centers…
clusters for Hadoop/HBase, for Storm, for MySQL,
for Memcached, for Cassandra, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization… various notions
of “cloud” help
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
71Saturday, 31 August 13
72. Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of disruption in topologies
could this imply? because there’s
no such thing as RAM anymore…
72Saturday, 31 August 13
73. Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
because the data won’t fit on one computer anymore
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem
C A
P
strong
consistency
high
availability
partition
tolerance
eventual
consistency
“You can have at most two of these properties for
any shared-data system… the choice of which
feature to discard determines the nature of your
system.” – Eric Brewer, 2000 (Inktomi/YHOO)
cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem
73Saturday, 31 August 13
74. Some Topologies Other Than Hadoop…
Spark (iterative/interactive)
Titan (graph database)
Redis (data structure server)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Riak (durable key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
ParAccel (MPP)
SciDB (array database)
74Saturday, 31 August 13
75. “Return of the Borg”
consider that Google is generations ahead of
Hadoop, etc., with much improved ROI on its
data centers…
Borg serves as a kind of “secret sauce” for
data center OS, with Omega as its next
evolution:
2011 GAFS Omega
John Wilkes, et al.
youtu.be/0ZFMlO98Jkc
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf,Andy Konwinski, Michael Abd-El-Malek, John Wilkes
eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
75Saturday, 31 August 13
76. “Return of the Borg”
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf,Andy Konwinski, Michael Abd-El-Malek, John Wilkes
eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
76Saturday, 31 August 13
77. “Return of the Borg”
Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon
Cade Metz
wired.com/wiredenterprise/2013/03/google-
borg-twitter-mesos
The Datacenter as a Computer: An Introduction
to the Design ofWarehouse-Scale Machines
Luiz André Barroso, Urs Hölzle
research.google.com/pubs/pub35290.html
77Saturday, 31 August 13
78. Mesos – definitions
a common substrate for cluster computing
heterogenous assets in your data center or cloud
made available as a homogenous set of resources
• top-level Apache project
• scalability to 10,000s of nodes
• obviates the need for virtual machines
• isolation between tasks with Linux Containers (pluggable)
• fault-tolerant replicated master using ZooKeeper
• multi-resource scheduling (memory and CPU aware)
• APIs in C++, Java, Python
• web UI for inspecting cluster state
• available for Linux, Mac OSX, OpenSolaris
78Saturday, 31 August 13
79. Mesos – simplifies app development
CHRONOS SPARK HADOOP DPARK MPI
JVM (JAVA, SCALA, CLOJURE, JRUBY)
MESOS
PYTHON C++
79Saturday, 31 August 13
80. Mesos – data center OS stack
HADOOP STORM CHRONOS RAILS JBOSS
TELEMETRY
Kernel
OS
Apps
MESOS
CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING
80Saturday, 31 August 13
81. Prior Practice: Dedicated Servers
DATACENTER
• low utilization rates
• longer time to ramp up new services
81Saturday, 31 August 13
82. Prior Practice: Virtualization
DATACENTER PROVISIONED VMS
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
82Saturday, 31 August 13
83. Prior Practice: Static Partitioning
DATACENTER STATIC PARTITIONING
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
• static partitioning limits elasticity
83Saturday, 31 August 13
84. MESOS
Mesos: One Large Pool Of Resources
DATACENTER
“We wanted people to be able to program
for the data center just like they program
for their laptop."
Ben Hindman
84Saturday, 31 August 13
85. What are the costs of Virtualization?
benchmark
type
OpenVZ
improvement
mixed workloads 210%-300%
LAMP (related) 38%-200%
I/O throughput 200%-500%
response time order magnitude
more pronounced
at higher loads
85Saturday, 31 August 13
86. What are the costs of Single Tenancy?
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
86Saturday, 31 August 13
87. Compelling arguments for Data Center OS
• obviates the need forVMs (licensing, adiosVMware)
• provides OS-level building blocks for developing new
distributed frameworks (learning curve, adios Hadoop)
• removes significantVM overhead (performance)
• requires less h/w to buy (CapEx), power and fix (OpEx)
• implies lessVMs, thus less Ops overhead (staff)
• removes the complexity of Chef/Puppet (staff)
• allows higher utilization rates (ROI)
• reduces latency for data updates (OLTP + OLAP on same server)
• reshapes cluster resources dynamically (100’s ms vs. minutes)
• runs dev/test clusters on same h/w as production (flexibility)
• evaluates multiple versions without more h/w (vendor lock-in)
87Saturday, 31 August 13
88. Opposite Ends of the Spectrum, One Substrate
Built-in /
bare metal
Hypervisors
Solaris Zones
Linux CGroups
88Saturday, 31 August 13
89. Opposite Ends of the Spectrum, One Substrate
Request /
Response
Batch
89Saturday, 31 August 13
90. Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure –
it’s how we build all our new services and is critical forTwitter’s
continued success at scale. It's one of the primary keys to our
data center efficiency."
Chris Fry, SVP Engineering
blog.twitter.com/2013/mesos-graduates-from-apache-incubation
• key services run in production: analytics, typeahead, ads
• Twitter engineers rely on Mesos to build all new services
• instead of thinking about static machines, engineers think
about resources like CPU, memory and disk
• allows services to scale and leverage a shared pool of
servers across data centers efficiently
• reduces the time between prototyping and launching
90Saturday, 31 August 13
91. Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the field of travel
more so than anyone has ever done before… a smaller number
of engineers can have higher impact through automation on
Mesos."
Mike Curtis,VP Engineering
gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...
• improves resource management and efficiency
• helps advance engineering strategy of building small teams
that can move fast
• key to letting engineers make the most of AWS-based
infrastructure beyond just Hadoop
• allowed company to migrate off Elastic MapReduce
• enables use of Hadoop along with Chronos, Spark, Storm, etc.
91Saturday, 31 August 13