Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
This document contains summaries of "folk knowledge" or common sayings related to data science. Some key points included are:
- Machine learning requires both data and some prior knowledge or assumptions in order to generalize beyond the training data.
- Overfitting can take many forms like high bias, high variance, or sampling bias.
- Intuition fails in high dimensions according to Bellman's curse of dimensionality.
- Feature engineering is key, and more data often beats a cleverer algorithm.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
This document discusses best practices for big data analytics. It emphasizes the importance of data curation to ensure semantic consistency and quality across diverse data sources. It warns against simply accumulating large amounts of ungoverned data ("data swamps") without relevant analytics or business applications. Instead, it advocates taking a full stack approach by building incremental decision models and data products to demonstrate value from the beginning. The document also stresses the need for data management layers, appropriate computing frameworks, and real-time and batch analytics capabilities to enable flexible exploration and insights.
The document outlines an agenda for a workshop on Pandas, data wrangling, and data science using Pandas. The agenda includes: an introduction and setup; discussing the data science pipeline and Pandas APIs/namespaces; basic Pandas maneuvers; data wrangling techniques like transformations, aggregations, and joins; hands-on exercises using datasets like Titanic and RecSys-2015; and a Q&A session. The goals are to understand data wrangling with Pandas through interactive examples and hands-on practice with real datasets.
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
The document discusses analyzing social networks and Twitter data using Python. It provides an introduction to analyzing the Twitter network of the user @clouderati, including 2072 followers. The presentation will cover topics like mentions, hashtags, retweets, and constructing a social graph to analyze cliques and networks. It also provides some tips for working with Twitter APIs and building scalable social media analysis pipelines in Python.
This document discusses various heuristics and principles for architecture design. It provides guidelines for creating simplified, evolvable systems using small modular components. Some key points discussed include using open architectures, building in options, and designing structures that are resilient to stress. The document also advocates for pattern-oriented, minimalist designs and evolutionary systems that can adapt over time without disrupting existing information. Overall, the document presents best practices for handling complexity, enabling flexibility, and ensuring architectures can withstand failures.
This document contains summaries of "folk knowledge" or common sayings related to data science. Some key points included are:
- Machine learning requires both data and some prior knowledge or assumptions in order to generalize beyond the training data.
- Overfitting can take many forms like high bias, high variance, or sampling bias.
- Intuition fails in high dimensions according to Bellman's curse of dimensionality.
- Feature engineering is key, and more data often beats a cleverer algorithm.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
This document discusses best practices for big data analytics. It emphasizes the importance of data curation to ensure semantic consistency and quality across diverse data sources. It warns against simply accumulating large amounts of ungoverned data ("data swamps") without relevant analytics or business applications. Instead, it advocates taking a full stack approach by building incremental decision models and data products to demonstrate value from the beginning. The document also stresses the need for data management layers, appropriate computing frameworks, and real-time and batch analytics capabilities to enable flexible exploration and insights.
The document outlines an agenda for a workshop on Pandas, data wrangling, and data science using Pandas. The agenda includes: an introduction and setup; discussing the data science pipeline and Pandas APIs/namespaces; basic Pandas maneuvers; data wrangling techniques like transformations, aggregations, and joins; hands-on exercises using datasets like Titanic and RecSys-2015; and a Q&A session. The goals are to understand data wrangling with Pandas through interactive examples and hands-on practice with real datasets.
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
The document discusses analyzing social networks and Twitter data using Python. It provides an introduction to analyzing the Twitter network of the user @clouderati, including 2072 followers. The presentation will cover topics like mentions, hashtags, retweets, and constructing a social graph to analyze cliques and networks. It also provides some tips for working with Twitter APIs and building scalable social media analysis pipelines in Python.
This document discusses various heuristics and principles for architecture design. It provides guidelines for creating simplified, evolvable systems using small modular components. Some key points discussed include using open architectures, building in options, and designing structures that are resilient to stress. The document also advocates for pattern-oriented, minimalist designs and evolutionary systems that can adapt over time without disrupting existing information. Overall, the document presents best practices for handling complexity, enabling flexibility, and ensuring architectures can withstand failures.
The document discusses the future of data science, including increased use of functional programming, cloud notebooks, and probabilistic modeling of large and diverse datasets from IoT devices, drones, and satellites. It also predicts data scientists will displace traditional product managers as data becomes more important for decision making. Overall, the future involves analyzing exponentially larger volumes of diverse data using scalable cloud tools and probabilistic algorithms.
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
An excursion into Text Analytics with Apache SparkKrishna Sankar
This document summarizes text analytics of US presidential primary debates and national mood. It discusses analyzing the mood of the nation from texts like State of the Union addresses. It then discusses analyzing the 2016 US presidential primary debates. The document outlines the text analytics pipeline including downloading data, preprocessing like removing stop words, feature extraction using TF-IDF, and analytics like logistic regression, LDA, and deep learning models. It provides references for further information on text analytics architectures and Spark capabilities for text analysis.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
This document summarizes a presentation on machine learning and Hadoop. It discusses the current state and future directions of machine learning on Hadoop platforms. In industrial machine learning, well-defined objectives are rare, predictive accuracy has limits, and systems must precede algorithms. Currently, Hadoop is used for data preparation, feature engineering, and some model fitting. Tools include Pig, Hive, Mahout, and new interfaces like Spark. The future includes YARN for running diverse jobs and improved machine learning libraries. The document calls for academic work on feature engineering languages and broader model selection ontologies.
Big Data is changing abruptly, and where it is likely headingPaco Nathan
Big Data technologies are changing rapidly due to shifts in hardware, data types, and software frameworks. Incumbent Big Data technologies do not fully leverage newer hardware like multicore processors and large memory spaces, while newer open source projects like Spark have emerged to better utilize these resources. Containers, clouds, functional programming, databases, approximations, and notebooks represent significant trends in how Big Data is managed and analyzed at large scale.
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
O'Reilly Learning is focusing on evolving learning experiences using Jupyter notebooks. Jupyter notebooks allow combining code, outputs, and explanations in a single document. O'Reilly is using Jupyter notebooks as a new authoring environment and is exploring features like computational narratives, code as a medium for teaching, and interactive online learning environments. The goal is to provide a better learning architecture and content workflow that leverages the capabilities of Jupyter notebooks.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
Spark is a distributed computing framework that can handle large scale data processing. The Spark notebook provides an interactive environment for working with Spark. ADAM is a data format and API for genomic data on Spark that optimizes for large datasets. Sparkling Water integrates H2O machine learning with Spark to enable techniques like deep learning on genomic data in a distributed manner using the Spark notebook. Data scientists and developers can collaborate using these tools to access, manipulate, and analyze massive genomic datasets.
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
The document outlines an agenda for a conference on Apache Spark and data science, including sessions on Spark's capabilities and direction, using DataFrames in PySpark, linear regression, text analysis, classification, clustering, and recommendation engines using Spark MLlib. Breakout sessions are scheduled between many of the technical sessions to allow for hands-on work and discussion.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
The document discusses the future of data science, including increased use of functional programming, cloud notebooks, and probabilistic modeling of large and diverse datasets from IoT devices, drones, and satellites. It also predicts data scientists will displace traditional product managers as data becomes more important for decision making. Overall, the future involves analyzing exponentially larger volumes of diverse data using scalable cloud tools and probabilistic algorithms.
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
An excursion into Text Analytics with Apache SparkKrishna Sankar
This document summarizes text analytics of US presidential primary debates and national mood. It discusses analyzing the mood of the nation from texts like State of the Union addresses. It then discusses analyzing the 2016 US presidential primary debates. The document outlines the text analytics pipeline including downloading data, preprocessing like removing stop words, feature extraction using TF-IDF, and analytics like logistic regression, LDA, and deep learning models. It provides references for further information on text analytics architectures and Spark capabilities for text analysis.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
This document summarizes a presentation on machine learning and Hadoop. It discusses the current state and future directions of machine learning on Hadoop platforms. In industrial machine learning, well-defined objectives are rare, predictive accuracy has limits, and systems must precede algorithms. Currently, Hadoop is used for data preparation, feature engineering, and some model fitting. Tools include Pig, Hive, Mahout, and new interfaces like Spark. The future includes YARN for running diverse jobs and improved machine learning libraries. The document calls for academic work on feature engineering languages and broader model selection ontologies.
Big Data is changing abruptly, and where it is likely headingPaco Nathan
Big Data technologies are changing rapidly due to shifts in hardware, data types, and software frameworks. Incumbent Big Data technologies do not fully leverage newer hardware like multicore processors and large memory spaces, while newer open source projects like Spark have emerged to better utilize these resources. Containers, clouds, functional programming, databases, approximations, and notebooks represent significant trends in how Big Data is managed and analyzed at large scale.
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
O'Reilly Learning is focusing on evolving learning experiences using Jupyter notebooks. Jupyter notebooks allow combining code, outputs, and explanations in a single document. O'Reilly is using Jupyter notebooks as a new authoring environment and is exploring features like computational narratives, code as a medium for teaching, and interactive online learning environments. The goal is to provide a better learning architecture and content workflow that leverages the capabilities of Jupyter notebooks.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
Spark is a distributed computing framework that can handle large scale data processing. The Spark notebook provides an interactive environment for working with Spark. ADAM is a data format and API for genomic data on Spark that optimizes for large datasets. Sparkling Water integrates H2O machine learning with Spark to enable techniques like deep learning on genomic data in a distributed manner using the Spark notebook. Data scientists and developers can collaborate using these tools to access, manipulate, and analyze massive genomic datasets.
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
The document outlines an agenda for a conference on Apache Spark and data science, including sessions on Spark's capabilities and direction, using DataFrames in PySpark, linear regression, text analysis, classification, clustering, and recommendation engines using Spark MLlib. Breakout sessions are scheduled between many of the technical sessions to allow for hands-on work and discussion.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
This document outlines steps for developing analytic applications using Apache Spark and Python. It covers prerequisites for accessing flight and weather data, deploying a simple data pipe tool to build training, test, and blind datasets, and using an IPython notebook to train predictive models on flight delay data. The agenda includes accessing necessary services on Bluemix, preparing the data, training models in the notebook, evaluating model accuracy, and deploying models.
This document provides an introduction to Apache Spark, including its core components, architecture, and programming model. Some key points:
- Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure, which are immutable distributed collections that allow in-memory computing across a cluster.
- RDDs support transformations like map, filter, reduce, and actions like collect that return results. Transformations are lazy while actions trigger computation.
- Spark's execution model involves a driver program that coordinates tasks on worker nodes using an optimized scheduler.
- Spark SQL, MLlib, GraphX, and Spark Streaming extend the core Spark API for structured data, machine learning, graph processing, and stream processing
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
- The document discusses a presentation given by Jongwook Woo on introducing Spark and its uses for big data analysis. It includes information on Woo's background and experience with big data, an overview of Spark and its components like RDDs and task scheduling, and examples of using Spark for different types of data analysis and use cases.
This document provides an overview of a machine learning workshop including tutorials on decision tree classification for flight delays, clustering news articles with k-means clustering, and collaborative filtering for movie recommendations using Spark. The tutorials demonstrate loading and preparing data, training models, evaluating performance, and making predictions or recommendations. They use Spark MLlib and are run in Apache Zeppelin notebooks.
This document provides an overview of Scala and compares it to Java. It discusses Scala's object-oriented and functional capabilities, how it compiles to JVM bytecode, and benefits like less boilerplate code and support for functional programming. Examples are given of implementing a simple Property class in both Java and Scala to illustrate concepts like case classes, immutable fields, and less lines of code in Scala. The document also touches on Java interoperability, learning Scala gradually, XML processing capabilities, testing frameworks, and tool/library support.
An Introduct to Spark - Atlanta Spark Meetupjlacefie
- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone.
- Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing.
- Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.
Scala presentation by Aleksandar ProkopecLoïc Descotte
This document provides an introduction to the Scala programming language. It discusses how Scala runs on the Java Virtual Machine, supports both object-oriented and functional programming paradigms, and provides features like pattern matching, immutable data structures, lazy evaluation, and parallel collections. Scala aims to be concise, expressive, and extensible.
Here are the answers to your questions:
1. The main differences between a Trait and Abstract Class in Scala are:
- Traits can be mixed in to classes using with, while Abstract Classes can only be extended.
- Traits allow for multiple inheritance as they can be mixed in, while Abstract Classes only allow single inheritance.
- Abstract Classes can have fields and constructor parameters while Traits cannot.
- Abstract Classes can extend other classes, while Traits can only extend other Traits.
2. abstract class Animal {
def isMammal: Boolean
def isFriendly: Boolean = true
def summarize: Unit = {
println("Characteristics of animal:")
}
Oracle Database 12c includes over 500 new features. Some key new features include:
- Oracle Database 12c Express (EM Express) which replaces Database Control and has less features than Database Control but does not require Java or an app server.
- New online capabilities like online DDL operations with no DDL locking, online move of partitions with no impact to queries, and online statistics gathering for bulk loads.
- Adaptive SQL Plan Management which allows the optimizer to select a more optimal plan at execution time based on current statistics.
- Multitenant architecture which allows consolidation of multiple databases into one container database with pluggable databases.
The document summarizes new features in Oracle Database 12c from Oracle 11g that would help a DBA currently using 11g. It lists and briefly describes features such as the READ privilege, temporary undo, online data file move, DDL logging, and many others. The objectives are to make the DBA aware of useful 12c features when working with a 12c database and to discuss each feature at a high level within 90 seconds.
Oracle12 - The Top12 Features by NAYA TechnologiesNAYATech
The document discusses the top 12 new features of Oracle 12c, as presented by David Yahalom of NAYA Technologies. It covers improved column defaults, increased size limits, improved top-N queries, temporary UNDO, new partitioning features, transaction guard, adaptive execution plans, enhanced statistics, data optimization and information lifecycle management (ILM), row pattern matching, and a 50% discount code for a Oracle performance tuning seminar offered by NAYA Technologies.
The document provides an agenda for a DevOps advanced class on Spark being held in June 2015. The class will cover topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, Spark SQL, PySpark, and Spark Streaming. It will include labs on DevOps 101 and 102. The instructor has over 5 years of experience providing Big Data consulting and training, including over 100 classes taught.
This document introduces Spark SQL 1.3.0 and how to optimize efficiency. It discusses the main objects like SQL Context and how to create DataFrames from RDDs, JSON, and perform operations like select, filter, groupBy, join, and save data. It shows how to register DataFrames as tables and write SQL queries. DataFrames also support RDD actions and transformations. The document provides references for learning more about DataFrames and their development direction.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
Big data refers to data processing that scales to large data sets and is processed in a distributed, fault-tolerant manner and stored so it can be accessed from anywhere. While the size of data is important, big data is more about how the data is processed and accessed rather than strictly defining it by size. Hadoop has emerged as a common platform for working with big data by providing an affordable and functional system that can handle enormous scales of data.
Data Science at Scale - The DevOps ApproachMihai Criveti
DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
The document provides an overview of the Spark framework for lightning fast cluster computing. It discusses how Spark addresses limitations of MapReduce-based systems like Hadoop by enabling interactive queries and iterative jobs through caching data in-memory across clusters. Spark allows loading datasets into memory and querying them repeatedly for interactive analysis. The document covers Spark's architecture, use of resilient distributed datasets (RDDs), and how it provides a unified programming model for batch, streaming, and interactive workloads.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
This document provides an overview of steps to build an agile analytics application, beginning with raw event data and ending with a web application to explore and visualize that data. The steps include:
1) Serializing raw event data (emails, logs, etc.) into a document format like Avro or JSON
2) Loading the serialized data into Pig for exploration and transformation
3) Publishing the data to a "database" like MongoDB
4) Building a web interface with tools like Sinatra, Bootstrap, and JavaScript to display and link individual records
The overall approach emphasizes rapid iteration, with the goal of creating an application that allows continuous discovery of insights from the source data.
This document summarizes challenges in assembling large DNA sequence data sets and strategies to address them.
1. The cost to generate DNA sequence data is decreasing rapidly, creating data sets too large for most computers to assemble. Hundreds to thousands of such data sets are generated each year.
2. Techniques like streaming compression and low-memory probabilistic data structures allow assembly memory usage to scale linearly with the sample size rather than the total data, enabling assembly of larger datasets.
3. Benchmarking different computational platforms revealed that while some platforms have faster processors, the ability to store large amounts of data locally is also important for assembly tasks. Scaling algorithms, rather than just optimizing code, is key to addressing
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
This document outlines a data science competition to build a spam detector using email data. Participants will be provided with training data containing 600 emails and their corresponding labels (spam or not spam). They will use this data to build a model to classify new emails as spam or not spam. The goal is to correctly classify as many new test emails as possible. Visualization and interpretation of results will be important for evaluating model performance and identifying ways to improve the spam detection.
This document discusses the role of data scientists and big data. It begins by providing an overview of what a data scientist does, including asking good questions about data, exploring and modeling data, and creating data products that provide business insights. It then discusses key aspects of big data, such as how it forces changes to how data is collected, stored, and analyzed. The document emphasizes that big data alone is not useful and must be "refined" through data science techniques to extract meaningful insights.
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
This document provides information about an introductory workshop on computer programming and Python. It includes topics like what programming is, programming languages and their expressive power, Python as the language that will be used, and assignments for students to complete including printing text, introducing themselves, iterating through lists, and working with data types and objects.
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
We’ve all heard that AI is going to become as ubiquitous in the enterprise as the telephone, but what does that mean exactly?
Everyone in IBM has a telephone; and everyone knows how to use her telephone; and yet IBM isn’t a phone company. How do we bring AI to the same standard of ubiquity — where everyone in a company has access to AI and knows how to use AI; and yet the company is not an AI company?
In this talk, we’ll break down the challenges a domain expert faces today in applying AI to real-world problems. We’ll talk about the challenges that a domain expert needs to overcome in order to go from “I know a model of this type exists” to “I can tell an application developer how to apply this model to my domain.”
We’ll conclude the talk with a live demo that show cases how a domain expert can cut through the five stages of model deployment in minutes instead of days using IBM and other open source tools.
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
Big data and artificial intelligence have developed through an iterative process where increased data leads to improved infrastructure which then enables the collection of even more data. This virtuous cycle began with the rise of the internet and web data in the 1990s. Modern frameworks like Hadoop and algorithms like MapReduce established the infrastructure needed to analyze large, distributed datasets and fuel machine learning applications. Deep learning techniques are now widely used for tasks involving images, text, video and other complex data types, with many companies seeking to gain advantages by leveraging proprietary datasets.
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
This document provides instructions and examples for analyzing and visualizing event data in an agile manner. It discusses loading event data stored in Avro format using tools like Pig and displaying the data in a browser. Specific steps outlined include using Cat to view Avro data, loading the data into Pig and using Illustrate to view sample records. The overall approach emphasized is to work with atomic event data in an iterative way using Pig and other Hadoop tools to explore and visualize the data.
Here are a few suggestions to help with the code challenge:
1. To convert the socket server to a Kafka producer, use Spark Structured Streaming's writeStream to write to a Kafka topic instead of the console output. You'll need Kafka and Zookeeper running.
2. The Satori streaming data source provides simpler JSON data than the raw weather data. You could read from a Satori stream instead of the socket.
3. To parse the weather data, first define a case class to hold the fields. Then use from_json to parse each row into the case class. You may need to handle errors/missing fields.
4. Once the weather data is in a DataFrame, you can perform
Software Carpentry and the Hydrological Sciences @ AGU 2013Aron Ahmadia
This document discusses bringing computational skills training to hydrologists through Software Carpentry workshops. It notes that while many hydrologists are focused on their research, computational methods are now essential. Software Carpentry teaches practical skills like the Unix shell, version control with Git, Python and R programming, and databases. These intensive, short workshops have been effective at training graduate students. The document encourages hydrologists to host their own workshops and support computational literacy by discussing code and practices in their papers.
Software Carpentry for the Geophysical SciencesAron Ahmadia
This document summarizes a presentation given by Aron Ahmadia at the ESIP Winter Meeting in January 2014 on Software Carpentry for the Geophysical Sciences. The presentation discussed how most scientists do not have strong computational skills and rely on outdated tools. It introduced Software Carpentry, which teaches practical computational skills like the Unix shell, version control with Git, and programming in Python and R. These skills can help scientists more effectively manage, share, and validate their work. The presentation encouraged scientists to get involved by attending or hosting Software Carpentry workshops, and contributing teaching materials relevant to earth sciences.
Similar to Data Science with Spark - Training at SparkSummit (East) (20)
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
This document outlines an agenda for a tutorial on data wrangling for Kaggle data science competitions. The tutorial covers the anatomy of a Kaggle competition, algorithms for amateur data scientists, model evaluation and interpretation, and hands-on sessions for three sample competitions: Titanic, Data Science London, and PAKDD 2014. The goals are to familiarize participants with competition mechanics, explore algorithms and the data science process, and have participants submit entries for three competitions by applying algorithms like CART, random forests, and SVMs to Kaggle datasets.
Bayes' theorem provides a way to calculate conditional probabilities based on observed data. It establishes the relationship between the conditional probability of an event (A), given that another event (B) is known to have occurred, and the inverse conditional probability. Bayes' theorem allows inference of probabilities and is useful for classification problems using Bayesian classifiers. It plays an important role in statistical inference and predictive modeling.
Notes about Amazon VPC, a canonical architecture and finally how to implement MongoDB replica sets. My blog http://goo.gl/0guF2 has the color pictures. And the file is at http://doubleclix.files.wordpress.com/2012/10/vpc-distilled-04.pdf. For some reason, slideshare trims the colors.
This document provides an overview of scrum, an agile framework for managing software development projects. It defines key scrum concepts like user stories, sprints, daily standups, sprint reviews and retrospectives. The scrum process is explained through 8 steps: release planning, decomposing features to user stories, sprint planning, the 2-week sprint, daily standups, sprint review, sprint retrospective, and adjusting the backlog for the next sprint. Scrum aims to deliver working software frequently through short iterative cycles, continuous improvement and close business/developer collaboration.
The document discusses the concept of big data and provides some key insights:
1) Big data refers to large data sets that cannot be processed by traditional software and hardware. It is characterized by high volume, velocity, and variety of data.
2) Examples of big data sources include social media firehoses from Twitter and large customer transaction logs from retailers.
3) Analyzing big data requires specialized architectures and techniques to handle the scale, speed, and complexity of the data. Common approaches involve distributed file systems and map-reduce processing frameworks like Hadoop.
4) The goal of big data analytics is to extract useful insights, trends and patterns from these large and diverse data sets to help improve business
The document summarizes precision time synchronization techniques. It begins with an overview of time synchronization and its applications in fields like industrial automation, stock trading, and cloud computing. It then provides details on IEEE 1588, including its objectives to achieve sub-microsecond synchronization across networked devices and support for heterogeneous clock systems. The document discusses PTP communication ports, roles, and the Best Master Clock Algorithm for determining roles. It also outlines PTP message types and how hardware-assisted time stamping increases accuracy. Lastly, it promotes participation in the 2012 International Symposium on Precision Clock Synchronization for Measurement, Control and Communication.
1) The document provides instructions for setting up an AWS account and launching an EC2 instance with an AMI that contains tools and documentation for a hands-on tutorial on NoSQL databases and MongoDB.
2) The tutorial covers basic MongoDB commands and demonstrates how to create, insert, update, and query document data using the mongo shell client. Embedded and nested documents are explored along with geospatial queries.
3) A map-reduce example aggregates historical check-in data to calculate popular locations over different time periods, demonstrating how MongoDB supports batch operations.
The document summarizes a demonstration of interoperability between cloud standards CDMI and OCCI. It includes an agenda, overview of the demo architecture showing how CDMI and OCCI interact across storage and compute clouds, and summaries of CDMI, OCCI, the demo topology, and lessons learned from the JavaFX client demo. The goal is to take a first step toward cloud interoperability by showing how CDMI storage primitives and OCCI can work across different cloud types and components.
My talk on NOSQL at OGF29.[Update with OSCON'10 presentation!] But updates do not work reliably in slideshare. So I also have latest version with my blog.
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
This presentation is about health care analysis using sentiment analysis .
*this is very useful to students who are doing project on sentiment analysis
*
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
Do People Really Know Their Fertility Intentions? Correspondence between Sel...Xiao Xu
Fertility intention data from surveys often serve as a crucial component in modeling fertility behaviors. Yet, the persistent gap between stated intentions and actual fertility decisions, coupled with the prevalence of uncertain responses, has cast doubt on the overall utility of intentions and sparked controversies about their nature. In this study, we use survey data from a representative sample of Dutch women. With the help of open-ended questions (OEQs) on fertility and Natural Language Processing (NLP) methods, we are able to conduct an in-depth analysis of fertility narratives. Specifically, we annotate the (expert) perceived fertility intentions of respondents and compare them to their self-reported intentions from the survey. Through this analysis, we aim to reveal the disparities between self-reported intentions and the narratives. Furthermore, by applying neural topic modeling methods, we could uncover which topics and characteristics are more prevalent among respondents who exhibit a significant discrepancy between their stated intentions and their probable future behavior, as reflected in their narratives.
Data Science with Spark - Training at SparkSummit (East)
1. Data Science Training
Spark
“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
"There are no facts, only interpretations." - Friedrich Nietzsche
"If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions!
------ We are not going to hang data by it’s legs !!
http://training.databricks.com/workshop/datasci.pdf
2. ADVANCED: DATA SCIENCE WITH APACHE SPARK
Data Science applications with Apache Spark combine the scalability of Spark and the
distributed machine learning algorithms.
This material expands on the “Intro to Apache Spark” workshop. Lessons focus on
industry use cases for machine learning at scale, coding examples based on public
data sets, and leveraging cloud-based notebooks within a team context. Includes
limited free accounts on Databricks Cloud.
Topics covered include:
Data transformation techniques based on both Spark SQL and functional
programming in Scala and Python.
Predictive analytics based on MLlib, clustering with KMeans, building classifiers
with a variety of algorithms and text analytics – all with emphasis on an
iterative cycle of feature engineering, modeling, evaluation.
Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
Understand how the primitives like Matrix Factorization are implemented in a
distributed parallel framework from the designers of MLlib
Several hands-on exercises using datasets such as Movielens, Titanic, State Of
the Union speeches, and RecSys Challenge 2015.
Prerequisites:
Intro to Apache Spark workshop or
equivalent (e.g., Spark Developer Certificate)
Experience coding in Scala, Python, SQL
Have some familiarity with Data Science
topics (e.g., business use cases)
4. Goals
• Patterns):)Data)wrangling)(Transform,)Model)&)Reason))with)Spark)
o Use)RDDs,)Transformations)and)Actions)in)the)context)of)a)Data)Science)Problem,)an)Algorithms))&)a)
Dataset)
• Spend)time)working)through)MLlib))
• Balance)between)internals)&)handsWon)
o Internals)from)Reza,)the)MLlib)lead)
• ~65%)of)time)on)Databricks)Cloud)&)Notebooks)
o Take)the)time)to)get)familiar)with)the)Interface)&)the)Data)Science)Cloud)
o Make*mistakes,*experiment,…*
• Good)Time)for)this)course,)this)version)
o Will)miss)many)of)the)gory)details)as)the)framework)evolves)
• Summarized)materials)for)a)3)day)course)
o Even)if)we)don’t)finish)the)exercises)today,)that)is)fine)
o Complete)the)work)at)home)W)There*are*also**homework*notebooks*
o Ask)us)questions)@ksankar,@pacoid,*@reza_zadeh,*@mhfalaki,*@andykonwinski,*@xmeng,*
@michaelarmbrust,*@tathadas*
5. Tutorial)Outline:)
morning' a)ernoon'
o Welcome)+)Ge3ng)Started)(Krishna))
o Databricks)Cloud)mechanics)(Andy)))
o Ex)3):)Clustering)B)In)which)we)explore)SegmenFng)
Frequent)InterGallacFcHoppers)(Krishna))
o Ex)0:)PreBFlight)Check)(Krishna)) o Ex)4):)RecommendaFon)(Krishna))
o DataScience)DevOps)B)IntroducFon)to)Spark)(Krishna)) o Theory):)Matrix)FactorizaFon,)SVD,…)(Reza))
o Ex)1:)MLlib):)StaFsFcs,)Linear)Regression)(Krishna)) o OnBline)kBmeans,)spark)streaming)(Reza))
o MLlib)Deep)Dive)–)Lecture)(Reza))
o Design)Philosophy,)APIs)
o Ex)2:)In)which)we)explore)Disasters,)Trees,)
ClassificaFon)&)the)Kaggle)CompeFFon)(Krishna))
o Random)Forest,)Bagging,)Data)DeBcorrelaFon)
o Ex)5):)Mood)of)the)UnionBText)AnalyFcs(Krishna))
o In)which)we)analyze)the)Mood)of)the)naFon)from)
inferences)on)SOTU)by)the)POTUS)(State)of)the)Union)
Addresses)by)The)President)Of)the)US))
o Deepdive)B)Leverage)parallelism)of)RDDs,)sparse)
vectors,)etc)(Reza))
o Ex)99):)RecSys)2015)Challenge)(Krishna))
o Ask)Us)Anything)B)Panel)
7. About Me
o Chief Data Scientist at BlackArrow.tv
o Have been speaking at OSCON, PyCon, Pydata,
Strata et al
o Reviewer “Machine Learning with Spark”
o Picked up co-authorship Second Edition of “Fast
Data Processing with Spark”
o Have done lots of things:
•
•
•
•
•
•
o @ksankar, doubleclix.wordpress.com ksankar42@gmail.com
The)Nuthead)band)!)
10. Everyone will receive a username/password for one !
of the Databricks Cloud shards. Use your laptop and browser to login there.
We find that cloud-based notebooks are a simple way to get started using
Apache Spark – as the motto “Making Big Data Simple” states.
Please create and run a variety of notebooks on your account throughout the
tutorial. These accounts will remain open long enough for you to export your
work.
See the product page or FAQ for more details, or contact Databricks to register
for a trial account.
10
Getting Started: Step 1
19. 19
Then create a clone of this folder in the folder that you just created:
Getting Started: Step 9
20. 20
Now let’s get started with the coding exercise!
We’ll define an initial Spark app in three lines of code:
Click on _00.pre-flight-check
Getting Started: Coding Exercise
23. Data Science :
The art of building a model with known knowns, which when let loose,
works with unknown unknowns!
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns&
&
&
&
&
&
Unknowns&
You
UnKnown& Known&
o Others)know,)you)don’t) o What)we)do)
o Facts,)outcomes)or)
scenarios)we)have)not)
encountered,)nor)
considered)
o “Black)swans”,)outliers,)
long)tails)of)probability)
distribuFons)
o Lack)of)experience,)
imaginaFon)
o PotenFal)facts,)
outcomes)we)
are)aware,)but)
not))with)
certainty)
o StochasFc)
processes,)
ProbabiliFes)
o
o
o
o
o
o
24. The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data
Products
o Data Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/theWcuriousWcaseWofWtheWdataWscientistWprofession/
25. Data Science - Context
o Scalable)Model)
Deployment)
o Big)Data)
automation)&)
purpose)built)
appliances)(soft/
hard))
o Manage)SLAs)&)
response)times)
o Volume)
o Velocity)
o Streaming)Data)
o Canonical)form)
o Data)catalog)
o Data)Fabric)across)the)
organization)
o Access)to)multiple)
sources)of)data))
o Think)Hybrid)–)Big)
Data)Apps,)Appliances)
&)Infrastructure)
Collect Store Transform
o Metadata)
o Monitor)counters)&)
Metrics)
o Structured)vs.)MultiW
structured)
o Flexible)&)Selectable)
! Data)Subsets))
! Attribute)sets)
o Refine)model)with)
! Extended)Data)
subsets)
! Engineered)
Attribute)sets)
o Validation)run)across)a)
larger)data)set)
Reason Model Deploy
Data Management
Data Science
o Dynamic)Data)Sets)
o 2)way)keyWvalue)tagging)of)
datasets)
o Extended)attribute)sets)
o Advanced)Analytics)
ExploreVisualize Recommend Predict
o Performance)
o Scalability)
o Refresh)Latency)
o InWmemory)Analytics)
o Advanced)Visualization)
o Interactive)Dashboards)
o Map)Overlay)
o Infographics)
" Bytes to Business
a.k.a. Build the full
stack
" Find Relevant Data
For Business
" Connect the Dots
26. Volume
Velocity
Variety
Data Science - Context
Context
Connect
edness
Intelligence
Interface
Inference
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) & Computational(GPU)
o Infer Significance & Causality
27. Day in the life of a (super) Model
Intelligence
Inference
Data Representation
Interface
Algorithms)
Parameters)Agributes)
Data)(Scoring))
Model)
SelecFon)
Reason)&)
Learn)
Models)
Visualize,)
Recommend,)
Explore)
Model)Assessment)
Feature)SelecFon)Dimensionality)ReducFon)
30. RDD – The workhorse of Spark
o Resilient Distributed Datasets
•
o Transformations – create RDDs
•
o Actions – Get values
•
o We will apply these operations during this tutorial
33. Session – 1 : MLlib - Statistics &
Linear Regression
1. Notebook):)01_StatsLRW1)
① Read)car)data)
② Stats)(Guided))
③ Correlation)(Guided))
④ Coding)ExerciseW21WTemplate)
(Correlation))
2. Notebook):)02_StatsLRW2)
① CEW21WSolution)
3. Linear)Regression)
① LR)(Guided))
② CEW22WTemplate((LR)on)Car)
Data))
4. Notebook):)03_StatsLRW3)
① CEW22WSolution)
② Explain)
37. Outline
Data flow vs. traditional network programming
Spark computing engine
Optimization Examples
Matrix Computations
MLlib + {Streaming, GraphX, SQL}
Future of MLlib
38. Data Flow Models
Restrict the programming interface so that the
system can do more automatically
Express jobs as graphs of high-level operators
» System picks how to split each operator into tasks
and where to run each task
» Run parts twice fault recovery
Biggest example: MapReduce
Map
Map
Map
Reduce
Reduce
39. Spark Computing Engine
Extends a programming language with a
distributed collection data-structure
» “Resilient distributed datasets” (RDD)
Open source at Apache
» Most active community in big data, with 50+
companies contributing
Clean APIs in Java, Scala, Python
Community: SparkR, soon to be merged
40. Key Idea
Resilient Distributed Datasets (RDDs)
» Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...)
» Built via parallel transformations (map, filter, …)
» The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
41. MLlib History
MLlib is a Spark subproject providing machine
learning primitives
Initial contribution from AMPLab, UC Berkeley
Shipped with Spark since Sept 2013
42. MLlib: Available algorithms
classification: logistic regression, linear SVM,"
naïve Bayes, least squares, classification tree
regression: generalized linear models (GLMs),
regression tree
collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
clustering: k-means||
decomposition: SVD, PCA
optimization: stochastic gradient descent, L-BFGS
43. Optimization
At least two large classes of optimization
problems humans can solve:"
» Convex
» Spectral
50. Spark PageRank
Given directed graph, compute node
importance. Two RDDs:
» Neighbors (a sparse graph/matrix)
» Current guess (a vector)
"
Using cache(), keep neighbor list in RAM
51. Spark PageRank
Using cache(), keep neighbor lists in RAM
Using partitioning, avoid repeated hashing
Neighbors
(id, edges)
Ranks
(id, rank)
join
partitionBy
join
join
…
55. Distributing Matrices
How to distribute a matrix across machines?
» By Entries (CoordinateMatrix)
» By Rows (RowMatrix)
» By Blocks (BlockMatrix)
All of Linear Algebra to be rebuilt using these
partitioning schemes
As!of!version!1.3!
56. Distributing Matrices
Even the simplest operations require thinking
about communication e.g. multiplication
How many different matrix multiplies needed?
» At least one per pair of {Coordinate, Row,
Block, LocalDense, LocalSparse} = 10
» More because multiplies not commutative
59. Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
•
o Data alone is not enough
•
o Machine Learning is not magic – one cannot get something from nothing
•
•
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
60. Classification - Spark API
• Logistic)Regression)
• SVMWithSGD)
• DecisionTrees)
• Data)as)LabelledPoint)(we)will)see)in)a)moment))
• DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,)impurity="gini",)
maxDepth=4,)maxBins=100))
• Impurity)–)“entropy”)or)“gini”)
• maxBins)=)control)to)throttle)communication)at)the)expense)of)accuracy)
o Larger)=)Higher)Accuracy)
o Smaller)=)less)communication)(as)#)of)bins)=)number)of)instances))
• Data)Adaptive)–)i.e.)decision)tree)samples)on)the)driver)and)figures)out)the)bin)spacing)i.e.)the)
places)you)slice)for)binning)
• Spark*=*Intelligent*Framework*D*need*this*for*scale*
61. Lookout for these interesting Spark
features
• Concept)of)Labeled)Point)&)how)to)create)an)
RDD)of)LPs)
• Print)the)tree)
• Calculate)Accuracy)&)MSE)from)RDDs)
67. Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
o Iteratively explore data
o Tools
•
o Get your head around data
•
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
•
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-
jeremy-howard/
68. Session-2 : Kaggle, Classification &
Trees
1. Notebook):)04_TitanicW01)
① Read)Training)Data)
② Henry)the)Sixth)Model)
③ Submit)to)Kaggle)
2. Notebook):)05_TitanicW02)
① Decision)Tree)Model)
② CEW31)Template)
① Create)Randomforest)Model)
② Predict)Testset)
③ Submit)to)Kaggle)
3. Notebook):)06_TitanicW03)
① CEW32)Solution)
① RandomForest)Model)
② Predict)Testset)
③ Submit)Solution)2)
② Discussion)about)Models)
69. Trees, Forests & Classification
• Discuss)Random)Forest)
o Boosting,)Bagging)
o Data)deWcorrelation))
• Why)it)didn’t)do)better)in)Titanic)dataset)
• Data)Science)Folk)Wisdom)
o http://www.slideshare.net/ksankar/dataWscienceWfolkW
knowledge)
71. Why didn’t RF do better ? Bias/Variance
o High Bias
•
•
•
•
•
o High Variance
•
•
•
•
Prediction Error!
Training !
Error!
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning
Curve
Need*more*features*or*more*
complex*model*to*improve*
Need*more*data*
to*improve*
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
hgp://www.slideshare.net/ksankar/dataBscienceBfolkBknowledge)
72. Decision Tree – Best Practices
maxDepth) Tune)with)Data/Model)SelecFon)
maxBins) Set)low,)monitor)communicaFons,)increase)if)needed)
#)RDD)parFFons) Set)to)#)of)cores)
• Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different
times, we need to utilize the compute power and in the end they average
out
• But for Machine Learning especially trees, all tasks are approx equal
computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
hgps://speakerdeck.com/jkbradley/mllibBdecisionBtreesBatBsfBscalaBbamlBmeetup)
DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,)
impurity="gini",)maxDepth=4,)maxBins=100))
75. Random Forests
• While)Boosting)splits)based)on)best)among)all)variables,)RF)splits)based)
on)best)among)randomly*chosen)variables)
• )Simpler)because)it)requires)two)variables)–)no.)of)Predictors)(typically)√k))
&)no.)of)trees)(500)for)large)dataset,)150)for)smaller))
• Error)prediction)
o For)each)iteration,)predict)for)dataset)that)is)not)in)the)sample)(OOB)data))
o Aggregate)OOB)predictions)
o Calculate)Prediction)Error)for)the)aggregate,)which)is)basically)the)OOB)estimate)of)error)rate)
• Can)use)this)to)search)for)optimal)#)of)predictors)
o We)will)see)how)close)this)is)to)the)actual)error)in)the)Heritage)Health)Prize)
• Assumes)equal)cost)for)misWprediction.)Can)add)a)cost)function)
• Proximity)matrix)&)applications)like)adding)missing)data,)dropping)
outliers))
Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
79. Clustering - Hands On :
• Normalization & Centering
• Clustering
• Optimizing k based on cohesively of the
clusters (WSSE)
1:30
80. Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm
•
•
o Learn many models, not Just One
•
•
•
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
•
o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o A few useful things to know about machine learning - by Pedro Domingos
! http://dl.acm.org/citation.cfm?id=2347755
81. Session-3 : Clustering
1. Notebook):)07_ClusterW1)
① Read)Data)
② Cluster)
③ Modeling)ExerciseW41WTemplate)
2. Notebook):)08_ClusterW2)
① MEW41WSolution)
② Center)and)Scale)
③ Cluster)
④ Inspect)centroid)
⑤ CEW42WTemplate):)Cluster)
Semantics)
3. Notebook):)09_ClusterW3)
① CEW42)Solution)
② Cluster)Semantics)W)Discussion)
82. Clustering - Theory
• Clustering)is)unsupervised)learning)
• While)the)computers)can)dissect)a)dataset)into)“similar”)clusters,)it)
still)needs)human)direction)&)domain)knowledge)to)interpret)&)
guide)
• Two)types:)
o Centroid)based)clustering)–)kWmeans)clustering)
o Tree)based)Clustering)–)hierarchical)clustering)
• Spark)implements)the)Scalable)Kmeans++))
o Paper):)http://theory.stanford.edu/~sergei/papers/vldb12Wkmpar.pdf)
83. Lookout for these interesting Spark
features
• Application)of)Statistics)toolbox)
• Center)&)Scale)RDD)
• Filter)RDDs)
89. Session-4 : Recommendation at
Scale
1. NotebookW10_RecoW1)
① Read)Movielens)medium)
data)
② CEW51)Template)–)Partition)
Data)
2. NotebookW11_RecoW2)
① CEW51)Solution)
② ALS)Slide)
③ Train)ALS)&)Predict)
④ Calculate)Model)
Performance)
90. Recommendation & Personalization - Spark
Automated*AnalyticsW)Let)Data)tell)story)
Feature)Learning,)AI,)Deep)Learning)
Learning*Models*W)fit)parameters)
as)it)gets)more)data))
Dynamic*Models*–)model)
selection)based)on)context)
o Knowledge)Based)
o Demographic)Based)
o Content)Based)
o Collaborative)Filtering)
o Item)Based)
o User)Based)
o Latent)Factor)based)
o User)Rating)
o Purchased)
o Looked/Not)purchased)
Spark)(in)1.1.0))implements)the)user)based)ALS)collaboraFve)filtering)
Ref:))
ALS)B)CollaboraFve)Filtering)for)Implicit)Feedback)Datasets,)Yifan)Hu);)AT&T)Labs.,)Florham)Park,)NJ);)Koren,)Y.);)Volinsky,)C.)
ALSBWR)B)LargeBScale)Parallel)CollaboraFve)Filtering)for)the)Nezlix)Prize,)Yunhong)Zhou,)Dennis)Wilkinson,)Robert)Schreiber,)Rong)P
95. Singular Value Decomposition
Two cases
» Tall and Skinny
» Short and Fat (not really)
» Roughly Square
SVD method on RowMatrix takes care of
which one to call.
97. Tall and Skinny SVD
Gets%us%%%V%and%the%
singular%values%
Gets%us%%%U%by%one%
matrix%multiplication%
98. Square SVD
ARPACK: Very mature Fortran77 package for
computing eigenvalue decompositions"
JNI interface available via netlib-java"
Distributed using Spark – how?
99. Square SVD via ARPACK
Only interfaces with distributed matrix via
matrix-vector multiplies
The result of matrix-vector multiply is small.
The multiplication can be distributed.
100. Square SVD
With 68 executors and 8GB memory in each,
looking for the top 5 singular vectors
102. All pairs Similarity
All pairs of cosine scores between n vectors
» Don’t want to brute force (n choose 2) m
» Essentially computes
"
Compute via DIMSUM
» Dimension Independent Similarity
Computation using MapReduce
103. Intuition
Sample columns that have many non-zeros
with lower probability. "
On the flip side, columns that have fewer non-
zeros are sampled with higher probability."
Results provably correct and independent of
larger dimension, m.
106. A General Platform
Spark Core
Spark
Streaming"
real-time
Spark SQL
structured
GraphX
graph
MLlib
machine
learning
…
Standard libraries included with Spark
107. Benefit for Users
Same engine performs data extraction, model
training and interactive queries
…
DFS
read
DFS
write
parse
DFS
read
DFS
write
train
DFS
read
DFS
write
query
DFS
DFS
read
parse
train
query
Separate engines
Spark
108. MLlib + Streaming
As of Spark 1.1, you can train linear models in
a streaming fashion, k-means as of 1.2
Model weights are updated via SGD, thus
amenable to streaming
More work needed for decision trees
109. MLlib + SQL
df = context.sql(“select latitude, longitude from tweets”)!
model = pipeline.fit(df)!
DataFrames in Spark 1.3! (March 2015)
Powerful coupled with new pipeline API
112. Goals for next version
Tighter integration with DataFrame and spark.ml API
Accelerated gradient methods & Optimization interface
Model export: PMML (current export exists in Spark 1.3, but
not PMML, which lacks distributed models)
Scaling: Model scaling (e.g. via Parameter Servers)
113. Research Goal: General
Distributed Optimization
Distribute%CVX%by%
backing%CVXPY%with%
PySpark%
%
EasyBtoBexpress%
distributable%convex%
programs%
%
Need%to%know%less%
math%to%optimize%
complicated%
objectives%
114. Most active open source community in big data
200+ developers, 50+ companies contributing
Spark Community
0
50
100
150
Contributors in past year
118. Hands On :
• Mood Of the Union
• RecSys 2015 Challenge
3:15
119. The Art of ELO Ranking
& Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
Ref : Who is #1, Princeton University Presshttps://doubleclix.wordpress.com/2015/01/20/theWartWofWnflWrankingWtheWeloWalgorithmWandWfivethirtyeight/)
120. Session-5 : Mood Of the Union –
Data Science on SOTU by POTUS
1. DataScience/12_SOTUW1)
① Read)BO)
② CEW61)Template):)Has)BO)
changed)since)2014)?)
2. DataScience/13_SOTUW2)
① CEW61)Solution)
② Read)GW)
③ Preprocess)
④ CEW62)Template):)What)
mood)the)country)was)in)
1790W1796)vs.)2009W2015)
3. Notebook):)14_SOTUW3)
① CEW62)Solution)
② Homework)
① GWB)vs)Clinton)
② WJC)vs)AL)
③ Discussions ) ))
122. Epilogue
• Interesting)Exercise)
• Highlights)
o Map-reduce in a couple of lines !
• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
o Set differences using substractByKey
o Ability to sort a map by values (or any arbitrary function, for that matter)
• To)Explore)as)homework:)
o TF-IDF in
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
hgp://blog.cloudera.com/blog/2014/09/howBtoBtranslateBfromBmapreduceBtoBapacheBspark/)
128. 83
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs!
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin!
graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf
Pregel: Large-scale graph computing at Google!
Grzegorz Czajkowski, et al.!
googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
GraphX: Unified Graph Analytics on Spark!
Ankur Dave, Databricks!
databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf
Advanced Exercises: GraphX!
databricks-training.s3.amazonaws.com/graph-analytics-with-graphx.html
GraphX: Further Reading…
130. 85
GraphX: Example – routing problems
What is the cost to reach node 0 from any other node in the graph? This is a
common use case for graph algorithms, e.g., Djikstra
133. Case Studies: Apache Spark, DBC, etc.
Additional details about production deployments for Apache Spark can be
found at:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By
+Spark
https://databricks.com/blog/category/company/partners
http://go.databricks.com/customer-case-studies
88
134. Case Studies: Automatic Labs
89
Spark Plugs Into Your Car!
Rob Ferguson!
spark-summit.org/east/2015/talk/spark-plugs-into-
your-car
finance.yahoo.com/news/automatic-labs-turns-
databricks-cloud-140000785.html
Automatic creates personalized driving habit dashboards
• wanted to use Spark while minimizing investment in DevOps
• provides data access to non-technical analysts via SQL
• replaced Redshift and disparate ML tools with single platform
• leveraged built-in visualization capabilities in notebooks to generate
dashboards easily and quickly
135. Spark at Twitter: Evaluation & Lessons Learnt!
Sriram Krishnan!
slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter
• Spark can be more interactive, efficient than MR
• support for iterative algorithms and caching
• more generic than traditional MapReduce
• Why is Spark faster than Hadoop MapReduce?
• fewer I/O synchronization barriers
• less expensive shuffle
• the more complex the DAG, the greater the !
performance improvement
90
Case Studies: Twitter
136. Pearson uses Spark Streaming for next generation
adaptive learning platform
Dibyendu Bhattacharya
databricks.com/blog/2014/12/08/pearson-uses-
spark-streaming-for-next-generation-adaptive-
learning-platform.html
91
• Kafka + Spark + Cassandra + Blur, on AWS on a YARN cluster
• single platform/common API was a key reason to replace Storm
with Spark Streaming
• custom Kafka Consumer for Spark Streaming, using Low Level
Kafka Consumer APIs
• handles: Kafka node failures, receiver failures, leader changes,
committed offset in ZK, tunable data rate throughput
Case Studies: Pearson
137. Unlocking Your Hadoop Data with Apache Spark and CDH5
Denny Lee
slideshare.net/Concur/unlocking-your-hadoop-data-with-apache-
spark-and-cdh5
92
• leading provider of spend management solutions and services
• delivers recommendations based on business users’ travel and
expenses – “to help deliver the perfect trip”
• use of traditional BI tools with Spark SQL allowed analysts to
make sense of the data without becoming programmers
• needed the ability to transition quickly between Machine Learning
(MLLib), Graph (GraphX), and SQL usage
• needed to deliver recommendations in real-time
Case Studies: Concur
138. Stratio Streaming: a new approach to Spark Streaming
David Morales, Oscar Mendez
spark-summit.org/2014/talk/stratio-streaming-a-new-
approach-to-spark-streaming
93
• Stratio Streaming is the union of a real-time messaging bus with a
complex event processing engine atop Spark Streaming
• allows the creation of streams and queries on the fly
• paired with Siddhi CEP engine and Apache Kafka
• added global features to the engine such as auditing and statistics
Case Studies: Stratio
139. Collaborative Filtering with Spark!
Chris Johnson!
slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark
• collab filter (ALS) for music recommendation
• Hadoop suffers from I/O overhead
• show a progression of code rewrites, converting a !
Hadoop-based app into efficient use of Spark
94
Case Studies: Spotify
140. Guavus Embeds Apache Spark !
into its Operational Intelligence Platform !
Deployed at the World’s Largest Telcos
Eric Carr
databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence-
platform-deployed-at-the-worlds-largest-telcos.html
95
• 4 of 5 top mobile network operators, 3 of 5 top Internet
backbone providers, 80% MSOs in NorAm
• analyzing 50% of US mobile data traffic, +2.5 PB/day
• latency is critical for resolving operational issues before they
cascade: 2.5 MM transactions per second
• “analyze first” not “store first ask questions later”
Case Studies: Guavus
141. Case Studies: Radius Intelligence
96
From Hadoop to Spark in 4 months, Lessons Learned!
Alexis Roos!
http://youtu.be/o3-lokUFqvA
• building a full SMB index took 12+ hours using !
Hadoop and Cascading
• pipeline was difficult to modify/enhance
• Spark increased pipeline performance 10x
• interactive shell and notebooks enabled data scientists !
to experiment and develop code faster
• PMs and business development staff can use SQL to !
query large data sets