This document discusses using open source tools and data science to drive business value. It provides an overview of Pivotal's data science toolkit, which includes tools like PostgreSQL, Hadoop, MADlib, R, Python, and more. The document discusses how MADlib can be used for machine learning and analytics directly in the database, and how R and Python can also interface with MADlib via tools like PivotalR and pyMADlib. This allows performing advanced analytics without moving large amounts of data.
Large-Scale Machine Learning with Apache SparkDB Tsai
Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings.
Bio:
Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.
Practical Machine Learning Pipelines with MLlibDatabricks
This talk from 2015 Spark Summit East discusses Pipelines and related concepts introduced in Spark 1.2 which provide a simple API for users to set up complex ML workflows.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
This presentation contains the basic of Apache Spark with details about the Machine Learning module of it. In the end of this presentation a demo has been shown which covers the machine learning pipeline with Spark. It also shows how to install standalone cluster in the local machine and how to deploy the application in the spark cluster.
Large-Scale Machine Learning with Apache SparkDB Tsai
Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings.
Bio:
Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.
Practical Machine Learning Pipelines with MLlibDatabricks
This talk from 2015 Spark Summit East discusses Pipelines and related concepts introduced in Spark 1.2 which provide a simple API for users to set up complex ML workflows.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
This presentation contains the basic of Apache Spark with details about the Machine Learning module of it. In the end of this presentation a demo has been shown which covers the machine learning pipeline with Spark. It also shows how to install standalone cluster in the local machine and how to deploy the application in the spark cluster.
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.
Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.
We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.
We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
This is an overview of the goals and roadmap for the Yellowbrick model visualization library (www.scikit-yb.org). If you're interested in contributing to Yellowbrick or writing visualizers, this is a good place to get started.
In the presentation we discuss the expected workflow of data scientists interacting with the model selection triple and Scikit-Learn. We describe the Yellowbrick API and it's relationship to the Scikit-Learn API. We introduce our primary object: the Visualizer, an estimator that learns from data and displays it visually. Finally we describe the requirements for developing for Yellowbrick, the tools and utilities in place and how to get started.
Yellowbrick is a suite of visual diagnostic tools called "Visualizers" that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, but to produce visualizations for your models!
This presentation was given during the opening session of the 2017 Spring DDL Research Labs.
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
Something really exciting and largely unnoticed is going on in the Spark ecosystem. As data scientists and engineers learn Spark, they’re actually all implicitly learning a much older, more general topic: typed functional programming. While Spark itself was built on an accumulation of powerful computer science concepts from functional programming and other areas, developers are often encountering these ideas in the context of Spark for the first time. It turns out that Spark makes an excellent platform for learning concepts like immutability, higher order and anonymous functions, laziness, and monadic operators.
This talk will discuss how Spark can be used as teaching tool, to build skills in areas like typed functional programming. We’ll explore a skill-building curriculum that can be used with a data scientist or engineer who only has experience in imperative, dynamically-typed languages like Python. This curriculum introduces the core concepts of functional programming and type theory, while providing learners the opportunity to immediately apply their skills at massive scale, using the power of Spark’s painless scalability and resilience.
Based on the experience of building machine learning teams at x.ai and other data-centric startups, this curriculum is the foundation of building poly-skilled, highly autonomous team members who can build scalable intelligent systems. We’ll work from foundational concepts of Scala and functional programming towards a fully implemented machine learning pipeline, all using Spark and MLlib. Unique new features of Spark like Datasets and Structured Streaming will be particularly useful in this effort. Using this approach, teams can help members in all roles learn how to use sophisticated programming techniques that ensure correctness at scale. With these skills in their toolbox, data scientists and engineers often find that building powerful machine learning systems is intuitive, easy, and even fun.
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.
Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.
This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.
Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.
We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.
We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
This is an overview of the goals and roadmap for the Yellowbrick model visualization library (www.scikit-yb.org). If you're interested in contributing to Yellowbrick or writing visualizers, this is a good place to get started.
In the presentation we discuss the expected workflow of data scientists interacting with the model selection triple and Scikit-Learn. We describe the Yellowbrick API and it's relationship to the Scikit-Learn API. We introduce our primary object: the Visualizer, an estimator that learns from data and displays it visually. Finally we describe the requirements for developing for Yellowbrick, the tools and utilities in place and how to get started.
Yellowbrick is a suite of visual diagnostic tools called "Visualizers" that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, but to produce visualizations for your models!
This presentation was given during the opening session of the 2017 Spring DDL Research Labs.
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
Something really exciting and largely unnoticed is going on in the Spark ecosystem. As data scientists and engineers learn Spark, they’re actually all implicitly learning a much older, more general topic: typed functional programming. While Spark itself was built on an accumulation of powerful computer science concepts from functional programming and other areas, developers are often encountering these ideas in the context of Spark for the first time. It turns out that Spark makes an excellent platform for learning concepts like immutability, higher order and anonymous functions, laziness, and monadic operators.
This talk will discuss how Spark can be used as teaching tool, to build skills in areas like typed functional programming. We’ll explore a skill-building curriculum that can be used with a data scientist or engineer who only has experience in imperative, dynamically-typed languages like Python. This curriculum introduces the core concepts of functional programming and type theory, while providing learners the opportunity to immediately apply their skills at massive scale, using the power of Spark’s painless scalability and resilience.
Based on the experience of building machine learning teams at x.ai and other data-centric startups, this curriculum is the foundation of building poly-skilled, highly autonomous team members who can build scalable intelligent systems. We’ll work from foundational concepts of Scala and functional programming towards a fully implemented machine learning pipeline, all using Spark and MLlib. Unique new features of Spark like Datasets and Structured Streaming will be particularly useful in this effort. Using this approach, teams can help members in all roles learn how to use sophisticated programming techniques that ensure correctness at scale. With these skills in their toolbox, data scientists and engineers often find that building powerful machine learning systems is intuitive, easy, and even fun.
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.
Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.
This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience:
- Practitioners will learn two key techniques for early success
- Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences
- Hiring managers will expand their knowledge of the skills required to bring business value with data
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology and Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. The combination of the two can provide a solution to power advanced analytics for not only what has happened in the past, but make intelligent predictions about the future. Please join this webinar to learn how get the most value from your data for your data driven business.
Learning Objectives:
How to scale your Redshift queries with user-defined functions (UDFs)
How to apply Machine learning to historical data in Amazon Redshift
How to visualize your data with Amazon QuickSight
Present a reference architecture for advanced analytics
Who Should Attend:
Application developers looking to add UDFs, or predictive analytics to their applications, database administrators that need to meet the demand of data driven organizations, decision makers looking to derive more insight from their data
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5Robert Grossman
This is a talk I gave in San Diego on July 29, 2009 explaining some of the impact and some of the opportunities of cloud computing on predictive analytics.
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
Abstract: How graphs became just another big data primitive
Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Data science and OSS
1. Hands-on Data Science
and OSS
Driving Business
Value with
Open Source and
Data Science
Kevin Crocker,
@_K_C_Pivotal
#datascience, #oscon
2.
3. VM info
Everything is ‘oscon2014’
User:password –> oscon2014:oscon2014
PostgreSQL 9.2.8 dbname -> oscon2014
Root password -> oscon2014
Installed software: postgresql 9.2.8, R, MADlib, pl/pythonu,
pl/pgpsl, anaconda ( /home/oscon2014/anaconda), pgadmin3,
Rstudio, pyMADlib, and more to come in v1.1
4. Objective of Data Science
DRIVE AUTOMATED
LOW LATENCY ACTIONS
IN RESPONSE TO
EVENTS OF INTEREST
5. What Matters: Apps. Data. Analytics.
Apps power businesses, and
those apps generate data
Analytic insights from that data
drive new app functionality,
which in-turn drives new data
The faster you can move
around that cycle, the faster
you learn, innovate & pull
away from the competition
6. What Matters: OSS at the core
Apps power businesses, and
those apps generate data
Analytic insights from that data
drive new app functionality,
which in-turn drives new data
The faster you can move
around that cycle, the faster
you learn, innovate & pull
away from the competition
7. End Game: Drive Business Value with OSS
interesting problems that can’t easily be solved with current
technology
Use (find) the right tool for the job
- If they don’t exist, create them
- Prefer OSS if it fits the need
- Drive business value through distributed, MPP analytics
- Operationalization (O16n) of your Analytics
Create interesting solutions that drive business value
8. 1 Find Data
Platforms
•Pivotal Greenplum
DB
•Pivotal HD
•Hadoop (other)
•SAS HPA
•AWS
2 Write Code
3 Run Code
Interfaces
•pgAdminIII
•psql
•psycopg2
•Terminal
•Cygwin
•Putty
•Winscp
4 Write Code for Big Data
5 Implement Algorithms
6 Show Results
7 Collaborate
Sharing Tools
•Chorus
•Confluence
•Socialcast
•Github
•Google Drive &
Hangouts
PIVOTAL DATA
SCIENCE TOOLKIT
A large and
varied tool box!
9. Toolkit?
This image was created
by Swami
Chandresekaran,
Enterprise Architect,
IBM.
He has a great article
about what it takes to
be a Data Scientist:
Road Map to Data
Scientist
http://nirvacana.com/tho
ughts/becoming-a-data-
scientist/
11. Open Source At Pivotal
Pivotal has a lot of open source projects (and people) involved in Open Source
PostgreSQL, Apache Hadoop (4)
MADlib (16), PivotalR (2), pyMADlib (4), Pandas via SQL (3),
Spring (56), Groovy (3), Grails (3)
Apache Tomcat (2) and HTTP Server (1)
Redis (1)
Rabbit MQ (4)
Cloud Foundry (90)
Open Chorus
We use a combination of our commercial software and OSS to drive business value through
Data Science
12. Motivation
Our story starts with SQL – so naturally we try to use SQL
for everything! Everything?
SQL is great for many things, but it’s not nearly enough
–Straightforward way to query data
–Not necessarily designed for
data science
Data Scientists know other
languages – R, Python, …
13. Our challenge
MADlib
– Open source
– Extremely powerful/scalable
– Growing algorithm breadth
– SQL
R / Python
– Open source
– Memory limited
– High algorithm breadth
– Language/interface purpose-designed for data science
Want to leverage both the performance benefits of MADlib and the
usability of languages like R and Python
14. How Pivotal Data Scientists Select Which
Tool to Use
Optimized for algorithm performance,
scalability, & code overhead
16. MADlib
MAD stands for:
lib stands for library of:
– advanced (mathematical, statistical, machine learning)
– parallel & scalable
– in-database functions
Mission: to foster widespread development of scalable
analytic skills, by harnessing efforts from commercial
practice, academic research, and open-source development
17. MADlib: A Community Project
Open Source: BSD License
• Developed as a partnership with multiple universities
– University of California-Berkeley
– University of Wisconsin-Madison
– University of Florida
• Compatibile with Postgres, Greenplum Database, and Hadoop
via HAWQ
• Designed for Data Scientists to provide Scalable, Robust
Analytics capabilities for their business problems.
• Homepage: http://madlib.net
• Documentation: http://doc.madlib.net
• Source: https://github.com/madlib
• Forum: http://groups.google.com/group/madlib-user-forum
Community
18. Database Platform Layer
MADlib: Architecture
C++ Database Abstraction Layer
User Defined FunctionsUser Defined Functions
User Defined TypesUser Defined Types
User Defined AggregatesUser Defined Aggregates
OLAP Window FunctionsOLAP Window Functions
User Defined OperatorsUser Defined Operators OLAP Grouping SetsOLAP Grouping Sets
Data Type MappingData Type Mapping
Exception HandlingException Handling
Logging and ReportingLogging and Reporting
Linear AlgebraLinear Algebra
Memory ManagementMemory Management
Boost SupportBoost Support
Support Modules
Random SamplingRandom Sampling
Array OperationsArray OperationsSparse VectorsSparse Vectors
Core Methods
Generalized Linear
Models
Generalized Linear
Models Matrix FactorizationMatrix Factorization
Machine Learning
Algorithms
Machine Learning
Algorithms
Linear SystemsLinear Systems
Probability FunctionsProbability Functions
19. MADlib: Diverse User Experience
SQL Python
Open Chorus R
from pymadlib.pymadlib import *
conn = DBConnect()
mdl = LinearRegression(conn)
lreg.train(input_table, indepvars, depvar)
cursor = lreg.predict(input_table, depvar)
scatterPlot(actual,predicted, dataset)
psql> madlib.linregr_train('abalone',
'abalone_linregr',
'rings',
'array[1,diameter,height]');
psql> select coef, r2 from abalone_linregr;
-[ RECORD 1 ]----------------------------------------------
coef | {2.39392531944631,11.7085575219689,19.8117069108094}
r2 | 0.350379630701758
20. MADlib In-Database
Functions
Predictive Modeling Library
Linear Systems
•Sparse and Dense Solvers
Matrix Factoriization
•Single Value Decomposition (SVD)
•Low-Rank
Generalized Linear Models
•Linear Regression
•Logistic Regression
•Multinomial Logistic Regression
•Cox Proportional Hazards
•Regression
•Elastic Net Regularization
•Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms
•Principal Component Analysis (PCA)
•Association Rules (Affinity Analysis, Market
Basket)
•Topic Modeling (Parallel LDA)
•Decision Trees
•Ensemble Learners (Random Forests)
•Support Vector Machines
•Conditional Random Field (CRF)
•Clustering (K-means)
•Cross Validation
Descriptive Statistics
Sketch-based Estimators
•CountMin (Cormode-
Muthukrishnan)
•FM (Flajolet-Martin)
•MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
21. Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’,
‘bedroom’);
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’,
‘bedroom’);
MADlib model function
Table containing
training data
Table in which
to save results
Column containing
dependent variable
Create multiple output
models (one for each value
of bedroom)
MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
All the data can be used in one
model
Built-in functionality to create of
multiple smaller models (e.g.
regression/classification grouped
by feature)
Open-source lets you tweak and
extend methods, or build your own
Features included in the
model
22. Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’);
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’);
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef
)as predict
FROM houses, houses_linregr m;
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef
)as predict
FROM houses, houses_linregr m;
MADlib model scoring function
Table with data to be scored Table containing model
MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
All the data can be used in one
model
Built-in functionality to create of
multiple smaller models (e.g.
regression/classification grouped
by feature)
Open-source lets you tweak and
extend methods, or build your own
23. K-Means Clustering
Clustering refers to the problem of partitioning a set of
objects according to some problem-dependent
measure of similarity. In the k-means variant, given n
points x1,…,xn d, the goal is to position k centroids∈ℝ
c1,…,ck d so that the sum of distances between∈ℝ
each point and its closest centroid is minimized. Each
centroid represents a cluster that consists of all points
to which this centroid is closest.
So, we are trying to find the centroids which minimize
the total distance between all the points and the
centroids.
24. K-means Clustering
Example Use Cases:
Which Blogs are Spam Blogs?
Given a user’s preferences, which
other blog might she/he enjoy?
What are our customers saying about
us?
25. What are our customers saying about us?
Discern trends and categories
on-line conversations?
- Search for relevant blogs
- ‘Fingerprinting’ based on word
frequencies
- Similarity Measure
- Identify ‘clusters’ of documents
26. What are our customers saying about us?
Method
• Construct document histograms
• Transform histograms into document “fingerprints”
• Use clustering techniques to discover similar
documents.
27. What are our customers saying about us?
Constructing document histograms
Parsing & extract html files
Using natural language processing for
tokenization and stemming
Cleansing inconsistencies
Transforming unstructured data into structured
data
28. What are our customers saying about us?
“Fingerprinting”
- Term frequency of words within a document vs.
frequency that those words occur in all
documents
- Term frequency-inverse document frequency (tf-
idf weight)
- Easily calculated based on formulas over the
document histograms.
- The result is a vector in n-dim. Euclidean space.
29. K-Means Clustering – Training Function
The k-means algorithm can be invoked in four ways, depending on the
source of the initial set of centroids:
1.Use the random centroid seeding method.
2.Use the kmeans++ centroid seeding method.
3.Supply an initial centroid set in a relation identified by the rel_initial_centroids
argument.
4.Provide an initial centroid set as an array expression in the initial_centroids
argument.
32. Initial Centroid set in a relation
kmeans( rel_source,
expr_point,
rel_initial_centroids, -- this is the relation
expr_centroid,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
33. Initial centroid as an array
kmeans( rel_source,
expr_point,
initial_centroids, -- this is the array
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
34. K-Means Clustering – Cluster Assignment
1. After training, the cluster assignment for each data point can be computed with
the help of the following function:
closest_column( m, x )
35. Assessing the quality of the clustering
A popular method to assess the quality of the clustering is the silhouette
coefficient, a simplified version of which is provided as part of the k-means
module. Note that for large data sets, this computation is expensive.
The silhouette function has the following syntax:
simple_silhouette( rel_source,
expr_point,
centroids,
fn_dist
)
36. What are our customers saying about us?
ANIMATED?
51. Pivotal & R Interoperability
In a traditional analytics workflow using R:
–Datasets are transferred from a data source
–Modeled or visualized
–Model scoring results are pushed back to the data source
Such an approach works well when:
–The amount of data can be loaded into memory, and
–The transfer of large amounts of data is inexpensive and/or fast
PivotalR explores the situation involving large data sets where
these two assumptions are violated and you have an R
background
52. Enter PivotalR
Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
Simple solution:
Translate R code into SQL
53. PivotalR Design Overview
2. SQL to execute
3. Computation results
1. R SQL
RPostgreSQL
Pivotal
R
Data lives hereNo data here
Database/Hadoop
w/ MADlib
54. PivotalR Design Overview
Call MADlib’s in-database machine learning functions
directly from R
Syntax is analogous to native R functions – for example,
madlib.lm() mimics the syntax of the native lm() function
Data does not need to leave the database
All heavy lifting, including model estimation & computation,
is done in the database
55. PivotalR Design Overview
Manipulate database tables directly from R without needing
to be familiar with SQL
Perform the equivalent of SQL’s ‘select’ statements
(including joins) in a syntax that is similar to R’s
data.frame operations
For example: R’s ‘merge’ SQL’s ‘join’
56. PivotalR: Current Functionality
MADlib Functionality
•Linear Regression
•Logistic Regression
•Elastic Net
•ARIMA
•Marginal Effects
•Cross Validation
•Bagging
•summary on model objects
• Automated Indicator
Variable Coding
as.factor
• predict
•$ [ [[ $<- [<- [[<-
•is.na
•+ - * / %% %/%
^
•& | !
•== != > < >= <=
•merge
•by
•db.data.frame
•as.db.data.frame
•preview•sort
•c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect
db.list db.objects
db.existsObject delete
•dim names
•content
And more ... (SQL wrapper)
http://github.com/gopivotal/PivotalR/
57. PivotalR Example
Load the PivotalR package
– > library('PivotalR')
Get help for a function
– > help(db.connect)
Connect to a database
– > db.connect(host = “dca.abc.com”, user = “student01”, dbname =
“studentdb”, password = ”studentpw", port = 5432, madlib =
"madlib", conn.pkg = "RPostgreSQL", default.schemas = NULL)
List connections
– > db.list()
58. PivotalR Example
Connect to a table via db.data.frame function (note that the data remains in the
database and is not loaded into memory)
– > y <- db.data.frame(“test.abalone”, conn.id = 1, key =
character(0), verbose = TRUE, is.temp = FALSE)
Fit a linear regression model (one model for each gender) and display it
– > fit <- madlib.lm(rings ~ . - id | sex, data = y)
– > fit # view the result
Apply the model to data in another table (i.e. x) and compute mean-square-error
– lookat(mean((x$rings - predict(fit, x))^2))
60. PivotalR
PivotalR is an R package you can download from CRAN.
- http://cran.r-project.org/web/packages/PivotalR/index.html
- Using Rstudio, you can install it with: install.packages("PivotalR")
GitHub has the latest, greatest code and features but is less stable.
- https://github.com/gopivotal/PivotalR
R front end to PostgreSQL and all PostgreSQL databases.
R wrapper around MADlib, the open source library for in-database scalable analytics
Mimics regular R syntax for manipulating R’s “data.frame”
Provides R functionality to Big Data stored in-database or Apache Hadoop.
Demo code: https://github.com/gopivotal/PivotalR/wiki/Example
Training Video: https://docs.google.com/file/d/0B9bfZ-YiuzxQc1RWTEJJZ2V1TWc/edit
62. SQL & RSQL & R
PL/R on Pivotal
Procedural Language (PL/X)
– X includes R, Python, pgSQL, Java, Perl, C etc
– need to be installed on each database
PL/R enables you to write PostgreSQL
and DB functions in the R language
R installed on each segment of the
Pivotal cluster
Parsimonious – R piggy-backs on
Pivotal’s parallel architecture
Minimize data movement
63. PL/R on Pivotal
Allows most of R’s capabilities. Basic guide: “PostgreSQL Functions by Example”
http://www.joeconway.com/presentations/function_basics.pdf
In PostgreSQL and GPDB/PHD: Check which PL languages are installed in database:
PL/R is an “untrusted” language – only database superusers have the ability to create
UDFs with PL/R (see “lanpltrusted” column in pg_language table)
select * from pg_language;
lanname | lanispl | lanpltrusted | lanplcallfoid | lanvalidator | lanacl
-----------+---------+--------------+---------------+--------------+--------
internal | f | f | 0 | 2246 |
c | f | f | 0 | 2247 |
sql | f | t | 0 | 2248 |
plpgsql | t | t | 10885 | 10886 |
plpythonu | t | f | 16386 | 0 |
plr | t | f | 18975 | 0 |
(6 rows)
64. PL/R Example
Consider the census dataset below (each row represents an individual):
– h_state = integer encoding which state they live in
– earns = their income
– hours = how many hours per week they work
– … and other features
Suppose we want to build a model of income for each state separately
66. PL/R Example
Prepare table for PL/R by converting it into array form
-- Create array version of table
DROP TABLE IF EXISTS use_r.census1_array_state;
CREATE TABLE use_r.census1_array_state AS(
SELECT
h_state::text h_state,
array_agg(h_serialno::float8) h_serialno, array_agg(earns::float8) earns,
array_agg(hours::float8) hours, array_agg((earns/hours)::float8) wage,
array_agg(hsdeg::float8) hsdeg, array_agg(somecol::float8) somecol,
array_agg(associate::float8) associate, array_agg(bachelor::float8) bachelor,
array_agg(masters::float8) masters, array_agg(professional::float8) professional,
array_agg(doctorate::float8) doctorate, array_agg(female::float8) female,
array_agg(rooms::float8) rooms, array_agg(bedrms::float8) bedrms,
array_agg(notcitizen::float8) notcitizen, array_agg(rentshouse::float8) rentshouse,
array_agg(married::float8) married
FROM use_r.census1
GROUP BY h_state
) DISTRIBUTED BY (h_state);
67. PL/R Example
SQL & RSQL & R
TN
Data
CA
Data
NY
Data
PA
Data
TX
Data
CT
Data
NJ
Data
IL
Data
MA Data WA
Data
TN Model CA Model NY Model PA Model TX Model CT Model NJ Model IL
Model
MA Model WA
Model
68. PL/R Example
Run linear regression to predict income in each state
–Define output data type
–Create PL/R function
Define
output type
PL/R
function
Body of the
function in R
SQL
Wrapper
SQL Wrapper
70. PL/R
PL/R is not installed – I have to download the source and compile it
Instructions can be found here
http://www.joeconway.com/plr/doc/plr-install.html
CHALLENGE: Download PL/R, compile it, and install it in PostgreSQL
72. Pivotal & Python Interoperability
In a traditional analytics workflow using Python:
–Datasets are transferred from a data source
–Modeled or visualized
–Model scoring results are pushed back to the data source
Such an approach works well when:
–The amount of data can be loaded into memory, and
–The transfer of large amounts of data is inexpensive and/or fast
pyMADlib explores the situation involving large data sets where
these two assumptions are violated and you have a Python
background
73. Enter pyMADlib
Challenge
Want to harness the familiarity of Python’s interface and the
performance & scalability benefits of in-DB analytics
Simple solution:
Translate Python code into SQL
74. pyMADlib Design Overview
2. SQL to execute
3. Computation results
1. Python SQL
ODBC/JDBC
Data lives hereNo data here
Database/Hadoop
w/ MADlib
75. Simple solution: Translate Python code
into SQL
All data stays in DB and all model estimation and heavy lifting done in DB by MADlib
Only strings of SQL and model output transferred across ODBC/JDBC
Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let
MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you
program in your favorite language – Python.
SQL to execute MADlib
Model output
ODBC/
JDBC
Python
SQL
79. PL/Python on Pivotal
Syntax is like normal Python function with function definition line replaced by
SQL wrapper
Alternatively like a SQL User Defined Function with Python inside
Name in SQL is plpythonu
– ‘u’ means untrusted so need to be superuser to create functions
CREATE FUNCTION pymax (a integer, b integer)
RETURNS integer
AS $$
if a > b:
return a
return b
$$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
80. Returning Results
Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS (
name text,
value integer
);
Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE FUNCTION make_pair (name text, value integer)
RETURNS named_value
AS $$
return [ name, value ]
# or alternatively, as tuple: return ( name, value )
# or as dict: return { "name": name, "value": value }
# or as an object with attributes .name and .value
$$ LANGUAGE plpythonu;
For functions which return multiple rows, prefix “setof” before the return type
81. Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value
AS $$
return ([ name, 1 ], [ name, 2 ], [ name, 3])
$$ LANGUAGE plpythonu;
Sequence
Generator
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value AS $$
for i in range(3):
yield (name, i)
$$ LANGUAGE plpythonu;
82. Accessing Packages
In an MPP environment: To be available, packages must be installed on
every individual segment node.
–Can use “parallel ssh” tool gpssh to conda/pip install
Then just import as usual inside function:
CREATE FUNCTION make_pair (name text)
RETURNS named_value
AS $$
import numpy as np
return ((name,i) for i in np.arange(3))
$$ LANGUAGE plpythonu;
83. Benefits of PL/Python
Easy to bring your code to the data
When SQL falls short leverage your Python (or R/Java/C)
experience quickly
Apply Python across terabytes of data with minimal
overhead or additional requirements
Results are already in the database system, ready for
further analysis or storage
84. Spring
What it is: Application Framework introduced as open source in 2003
Intention: Build enterprise-class Java applications more easily.
Outcomes:
1.Streamlined architecture, speeding application development by 2x and accelerating Time to
Value.
2.Portable, since Spring applications are identical for every platform.
Portable across multiple app servers
85. Spring Ecosystem
WEB
Controllers, REST,
WebSocket
INTEGRATION
Channels, Adapters,
Filters, Transformers
BATCH
Jobs, Steps,
Readers, Writers
BIG DATA
Ingestion, Export,
Orchestration, Hadoop
DATA
NON-RELATIONALRELATIONAL
CORE
GROOVYFRAMEWORK SECURITY REACTOR
GRAILS
Full-stack, Web
XD
Stream, Taps, Jobs
BOOT
Bootable, Minimal, Ops-Ready
http://projects.spring.io/spring-xd/ http://projects.spring.io/spring-data/
http://spring.io
86. Spring XD - Tackling Big Data Complexity
One stop shop for
–Data Ingestion
–Real-time Analytics
–Workflow Orchestration
–Data Export
Built on existing Spring
Assets
–Spring Integration, Batch, Data
XD = 'eXtreme Data'
Analytics
Big Data Programming Model
Ingest
HDFS
Compute
Jobs
Workflow
Export
RDBMS
Gemfire
Redis
Mobile SocialSensorFiles
OLAP
... ...
87. Groovy and Grails
Dynamic Language for the JVM
Inspired by Smalltalk, Python, and Ruby
Integrated with the Java language & platform at every level
88. “Cloud”
Means many things to many people.
Distributed applications accessible over a network
Typically, but not necessarily, The Internet
An application and/or it's platform
Resources on demand
Inherently virtualized
Can run in-house (private cloud) as well
Hardware and/or software sold as a commodity
89. Pivotal Speakers at OSCON 2014
10:40am Tuesday, Global Scaling at the New York Times using RabbitMQ, F150
Alvaro Videla (RabbitMQ), Michael Laing (New York Times)
Cloud
11:30am Tuesday, The Full Stack Java Developer, D136
Joshua Long (Pivotal), Phil Webb (Pivotal)
Java & JVM | JavaScript - HTML5 - Web
1:40pm Tuesday, A Recovering Java Developer Learns to Go, E142
Matt Stine (Pivotal)
Emerging Languages | Java & JVM
90. Pivotal Speakers at OSCON 2014
2:30pm Tuesday, Unicorns, Dragons, Open Source Business Models And Other Mythical
Creatures, PORTLAND BALLROOM, Main Stage
Andrew Clay Shafer (Pivotal)
11:30am Wednesday, Building a Recommendation Engine with Spring and Hadoop, D136
Michael Minella (Pivotal)
Java & JVM
1:40pm Wednesday, Apache Spark: A Killer or Savior of Apache Hadoop?, E143
Roman Shaposhnik (Pivotal)
Sponsored Sessions
91. Pivotal Speakers at OSCON 2014
10:00am Thursday, Developing Micro-services with Java and Spring, D139/140
Phil Webb (Pivotal)
Java & JVM | Tools & Techniques
11:00am Thursday, Apache HTTP Server; SSL from End-to-End, D136
William A Rowe Jr (Pivotal)
Security
92. Data Science At Pivotal
Drive business value by operationalizing Data Science models using a combination of our
commercial software (based on open source) and open source software.
Open Source is at the core of what we do
Thank You!
Kevin Crocker
kcrocker@gopivotal.com
@_K_C_Pivotal
Data Science Education Lead
Access MADlib’s functions through SQL – but also more than just SQL … can access MADlib functions through R (with PivotalR package) and Python (with pyMADlib library) as well as the Data Science Studio in Chorus.