This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
This document discusses building agile analytics applications. It recommends taking an iterative approach where data is explored interactively from the start to discover insights. Rather than designing insights upfront, the goal is to build an application that facilitates exploration of the data to uncover insights. This is done by setting up an environment where insights can be repeatedly produced and shared with the team. The focus is on using simple, flexible tools that work from small local data to large datasets.
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
This document provides instructions and examples for analyzing and visualizing event data in an agile manner. It discusses loading event data stored in Avro format using tools like Pig and displaying the data in a browser. Specific steps outlined include using Cat to view Avro data, loading the data into Pig and using Illustrate to view sample records. The overall approach emphasized is to work with atomic event data in an iterative way using Pig and other Hadoop tools to explore and visualize the data.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
The document describes a dataset containing on-time performance records for 95% of commercial flights in the United States. It includes over 30 fields of information for each flight such as airline, departure/arrival times, delays, distances, and causes of delays. An example record from the dataset is shown containing values for many of the fields.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
This document discusses building agile analytics applications. It recommends taking an iterative approach where data is explored interactively from the start to discover insights. Rather than designing insights upfront, the goal is to build an application that facilitates exploration of the data to uncover insights. This is done by setting up an environment where insights can be repeatedly produced and shared with the team. The focus is on using simple, flexible tools that work from small local data to large datasets.
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
This document provides instructions and examples for analyzing and visualizing event data in an agile manner. It discusses loading event data stored in Avro format using tools like Pig and displaying the data in a browser. Specific steps outlined include using Cat to view Avro data, loading the data into Pig and using Illustrate to view sample records. The overall approach emphasized is to work with atomic event data in an iterative way using Pig and other Hadoop tools to explore and visualize the data.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
The document describes a dataset containing on-time performance records for 95% of commercial flights in the United States. It includes over 30 fields of information for each flight such as airline, departure/arrival times, delays, distances, and causes of delays. An example record from the dataset is shown containing values for many of the fields.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
This document discusses building full stack data analytics applications using Apache Kafka and Apache Spark. It provides an overview of agile data science principles and methodologies. It also outlines various tools that can be used in the data pipeline and stack, such as Apache Spark, Apache Kafka, MongoDB, Elasticsearch, and d3.js. It discusses considerations for data structure and access patterns, as well as climbing the data value pyramid from raw data to higher order insights.
Social Network Analysis in Your Problem DomainRussell Jurney
This document discusses various types of networks that can be analyzed using social network analysis techniques. It provides examples of networks including founder networks, website behavior networks, online social networks, and email inbox networks. It also summarizes tools and methods for social network analysis including centrality measures, clustering, block models, cores, and dispersion analysis.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
This document summarizes a presentation on analyzing networks in problem domains. It provides examples of different types of networks that can be analyzed, including founder networks, website behavior networks, and online social networks. It also describes various tools and techniques for social network analysis, such as calculating centrality, clustering, and dispersion. The presentation emphasizes how to identify relevant entities and relationships to model a problem domain as a property graph and analyze it using graph databases and network analysis libraries.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
This document discusses best practices for big data analytics. It emphasizes the importance of data curation to ensure semantic consistency and quality across diverse data sources. It warns against simply accumulating large amounts of ungoverned data ("data swamps") without relevant analytics or business applications. Instead, it advocates taking a full stack approach by building incremental decision models and data products to demonstrate value from the beginning. The document also stresses the need for data management layers, appropriate computing frameworks, and real-time and batch analytics capabilities to enable flexible exploration and insights.
This document discusses various heuristics and principles for architecture design. It provides guidelines for creating simplified, evolvable systems using small modular components. Some key points discussed include using open architectures, building in options, and designing structures that are resilient to stress. The document also advocates for pattern-oriented, minimalist designs and evolutionary systems that can adapt over time without disrupting existing information. Overall, the document presents best practices for handling complexity, enabling flexibility, and ensuring architectures can withstand failures.
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
The document discusses using a graph database to store and query graph data stored in a Hadoop data lake more efficiently. It describes the limitations of the typical approach of using Spark/GraphFrames on HDFS for graph queries. A graph database allows for faster ad hoc graph queries by leveraging graph traversals. The document proposes using a multi-model database that combines a document store, graph database, and key-value store with a common query language. It suggests this approach could run on a DC/OS cluster for easy deployment and management of resources. Examples show importing data into ArangoDB and running graph queries.
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
The document outlines the workflow of a data scientist, from planning experiments and collecting data, to analyzing, visualizing, and publishing results. It emphasizes that data science involves formalizing hypotheses based on observations and testing them using collected data. A suite of open-source tools is presented to help data scientists in managing data and supporting open, reproducible life science research. The goal is to enable integration and sharing of experimental data and results.
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
This document appears to be a slide deck presentation on empowering others through data exploration. The presentation discusses removing barriers to data, making feedback fast, and removing yourself from blocking others. It emphasizes visualizing data pipelines and augmenting data warehouses with data lakes to handle varying data volumes, varieties, and velocities. The goal is to turn data into insights that create business value.
So your boss says you need to learn data scienceSusan Ibach
Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology
Self Evolving Model to Attain to State of Dynamic System AccuracyDataWorks Summit
This document discusses the potential for self-evolving machine learning models to provide dynamic system accuracy. It notes drawbacks of static models and outlines characteristics of self-evolving models, including their ability to work at scale, sense their environment, judge relevance, discover connections, retain and build upon previous learning, and learn from experience. The document argues that self-evolving models powered by Hadoop streaming and machine learning could fulfill expectations of low latency, high throughput, real-time performance, scalability, accuracy and dynamic context while overcoming human limitations of focus, variants, emotion and reliability.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
Eclipse science group presentation given at Eclipse Converge and Devoxx 2017 in California. These slides give an overview of projects in the Eclipse Science working group in 2017.
With the surge in Big Data, organizations have began to implement Big Data related technologies as a part of their system. This has lead to a huge need to update existing skillsets with Hadoop. Java professionals are one such people who have to update themselves with Hadoop skills.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
This document discusses building full stack data analytics applications using Apache Kafka and Apache Spark. It provides an overview of agile data science principles and methodologies. It also outlines various tools that can be used in the data pipeline and stack, such as Apache Spark, Apache Kafka, MongoDB, Elasticsearch, and d3.js. It discusses considerations for data structure and access patterns, as well as climbing the data value pyramid from raw data to higher order insights.
Social Network Analysis in Your Problem DomainRussell Jurney
This document discusses various types of networks that can be analyzed using social network analysis techniques. It provides examples of networks including founder networks, website behavior networks, online social networks, and email inbox networks. It also summarizes tools and methods for social network analysis including centrality measures, clustering, block models, cores, and dispersion analysis.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
This document summarizes a presentation on analyzing networks in problem domains. It provides examples of different types of networks that can be analyzed, including founder networks, website behavior networks, and online social networks. It also describes various tools and techniques for social network analysis, such as calculating centrality, clustering, and dispersion. The presentation emphasizes how to identify relevant entities and relationships to model a problem domain as a property graph and analyze it using graph databases and network analysis libraries.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
This document discusses best practices for big data analytics. It emphasizes the importance of data curation to ensure semantic consistency and quality across diverse data sources. It warns against simply accumulating large amounts of ungoverned data ("data swamps") without relevant analytics or business applications. Instead, it advocates taking a full stack approach by building incremental decision models and data products to demonstrate value from the beginning. The document also stresses the need for data management layers, appropriate computing frameworks, and real-time and batch analytics capabilities to enable flexible exploration and insights.
This document discusses various heuristics and principles for architecture design. It provides guidelines for creating simplified, evolvable systems using small modular components. Some key points discussed include using open architectures, building in options, and designing structures that are resilient to stress. The document also advocates for pattern-oriented, minimalist designs and evolutionary systems that can adapt over time without disrupting existing information. Overall, the document presents best practices for handling complexity, enabling flexibility, and ensuring architectures can withstand failures.
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
The document discusses using a graph database to store and query graph data stored in a Hadoop data lake more efficiently. It describes the limitations of the typical approach of using Spark/GraphFrames on HDFS for graph queries. A graph database allows for faster ad hoc graph queries by leveraging graph traversals. The document proposes using a multi-model database that combines a document store, graph database, and key-value store with a common query language. It suggests this approach could run on a DC/OS cluster for easy deployment and management of resources. Examples show importing data into ArangoDB and running graph queries.
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
The document outlines the workflow of a data scientist, from planning experiments and collecting data, to analyzing, visualizing, and publishing results. It emphasizes that data science involves formalizing hypotheses based on observations and testing them using collected data. A suite of open-source tools is presented to help data scientists in managing data and supporting open, reproducible life science research. The goal is to enable integration and sharing of experimental data and results.
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
This document appears to be a slide deck presentation on empowering others through data exploration. The presentation discusses removing barriers to data, making feedback fast, and removing yourself from blocking others. It emphasizes visualizing data pipelines and augmenting data warehouses with data lakes to handle varying data volumes, varieties, and velocities. The goal is to turn data into insights that create business value.
So your boss says you need to learn data scienceSusan Ibach
Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology
Self Evolving Model to Attain to State of Dynamic System AccuracyDataWorks Summit
This document discusses the potential for self-evolving machine learning models to provide dynamic system accuracy. It notes drawbacks of static models and outlines characteristics of self-evolving models, including their ability to work at scale, sense their environment, judge relevance, discover connections, retain and build upon previous learning, and learn from experience. The document argues that self-evolving models powered by Hadoop streaming and machine learning could fulfill expectations of low latency, high throughput, real-time performance, scalability, accuracy and dynamic context while overcoming human limitations of focus, variants, emotion and reliability.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
Eclipse science group presentation given at Eclipse Converge and Devoxx 2017 in California. These slides give an overview of projects in the Eclipse Science working group in 2017.
With the surge in Big Data, organizations have began to implement Big Data related technologies as a part of their system. This has lead to a huge need to update existing skillsets with Hadoop. Java professionals are one such people who have to update themselves with Hadoop skills.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Blistering fast access to Hadoop with SQLSimon Harris
Big SQL, Impala, and Hive were benchmarked on their ability to execute 99 queries from the TPC-DS benchmark at various scale factors. Big SQL was able to express all queries without rewriting, complete the full workload at 10TB and 30TB, and achieved the highest throughput. Impala and Hive required rewriting some queries and could only complete 70-73% of the workload at 10TB. The results indicate that query support, scale, and throughput are important factors to consider for SQL-on-Hadoop implementations.
El documento describe los componentes básicos de los generadores eólicos. Explica que la energía eólica proviene de la energía solar y el calentamiento diferencial del aire por el sol. También menciona que existen diferentes tipos de aerogeneradores según su potencia y número de palas. Luego enumera los principales componentes como el rotor, las palas, el eje de baja velocidad, la caja multiplicadora, el sistema de orientación y el sistema de soporte.
1) JSON-LD has seen widespread adoption with over 2 million HTML pages including it and it being a required format for Linked Data platforms.
2) A primary goal of JSON-LD was to allow JSON developers to use it similarly to JSON while also providing mechanisms to reshape JSON documents into a deterministic structure for processing.
3) JSON-LD 1.1 includes additional features like using objects to index into collections, scoped contexts, and framing capabilities.
El documento describe las características clave de un líder efectivo. Un líder debe tener la capacidad de comunicarse claramente, poseer inteligencia emocional para manejar los sentimientos propios y de otros, y establecer metas y objetivos congruentes con las capacidades del grupo. Además, un líder planea estratégicamente, aprovecha sus fortalezas y trabaja para mejorar sus debilidades, y ayuda a su gente a crecer delegando responsabilidades.
Consumers are interested in autonomous cars but still fear letting go of the wheel completely. While traffic is a major issue for city satisfaction, autonomous vehicles may help by freeing up drivers and improving the commute experience. Those most interested in autonomous cars tend to be professionals with children who already use cars to commute. Allowing cars to be shared more easily through technologies like digital keys could change whether people own cars or use them as a service. A variety of companies from traditional automakers to technology firms and public transport providers are seen as potential future providers of autonomous mobility options.
Zipcar is a car sharing service that allows users to rent vehicles by the hour or day. Members pay an annual fee of $70 plus hourly rates of $8.50 per hour or daily rates of $59. Zipcar has over 1 million members across 500 cities in 9 countries, with a fleet of 10,000 vehicles. The document outlines Zipcar's approach, history, competitors, and future outlook which includes increasing their fleet size and adding more hybrid and electric vehicles.
This document provides word of the day definitions for the words "clever", "dainty", "pounce", and "generous" across four sections. Each section defines the word, provides part of speech, examples, and discussion questions related to demonstrating or applying that word. The overall document aims to build vocabulary and comprehension through engaging examples and questions about the different words.
It's all about introduction to a blog which speaks about Destinations, Arts, Culture, People, Cuisines...Everything you would want to know about Kerala
Discover Life. Feel Divinity. Find Yourself...........Experience God's Own Country
La carta proporciona información sobre Michell Figueroa, un estudiante de la Universidad Fermín Toro en Barquisimeto. Figueroa está inscrito en la Facultad de Ciencias Jurídicas y Políticas, Escuela de Derecho, sección Saia D. La carta incluye su nombre completo y número de identificación.
An overview of Teraproc cluster-as-a-service offerings for high-performance distributed analytics. This overview presentation includes a step-by-step demonstration of the process of deploying a ready-to-run R Studio cluster environment on Amazon Web Services. More information available at http://teraproc.com
The document discusses creating an HTML page from a template. It breaks the template down into sections like header, main content, and footer. It then provides the HTML code to recreate each section, with explanations. For example, it shows how to code the header section with elements for quick links, logo, search bar, and navigation. It also demonstrates how to code the main content with different article sections. The document is intended to teach how to reconstruct a web page design in HTML.
Top Insights from SaaStr by Leading Enterprise Software ExpertsOpenView
Market Research
SHARE
I had the pleasure of attending the SaaStr Annual 2016 Conference in San Francisco earlier this month and wanted to share some of the insights I gathered from that event with you here. The findings below are arranged by functional area with attribution. I tried to compress the content as much as possible, but there was A TON of great information at the conference so would highly recommend spending the time to read through.
This document summarizes Rachel Andrew's presentation on CSS Grid Layout. Some key points:
- CSS Grid Layout provides a new two-dimensional layout system for CSS that solves many of the problems of previous methods like floats and flexbox.
- Grid uses line-based placement, with grid lines that can be explicit or implicit, to position items on the page. Properties like grid-column and grid-row position items within the grid.
- The grid template establishes the structure of rows and columns. Items can span multiple tracks. Fraction units like fr distribute space proportionally.
- Common layouts like Holy Grail are easily achieved with Grid. The structure can also adapt at breakpoints by redefining
Detailed report of IBM's 30TB Hadoop-DS report showing that IBM InfoSphere BigInsights (SQL-on-Hadoop) is able to execute all 99 TPC-DS queries at scale over native Hadoop data formats. Written by Simon Harris, Abhayan Sundararajan, John Poelman and Matthew Emmerton.
La motivación laboral se refiere a la capacidad de las empresas para mantener el estímulo positivo de sus empleados y su desempeño en el trabajo. Existen cuatro tipos de motivación: extrínseca, intrínseca, transitiva y trascendente. La motivación es importante para las empresas porque mejora la productividad individual y grupal de los empleados. Algunos factores que motivan incluyen tener responsabilidades, autonomía y objetivos claros, mientras que problemas interpersonales, falta de confianza y exceso de control desmotivan.
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
This document provides an overview of steps to build an agile analytics application, beginning with raw event data and ending with a web application to explore and visualize that data. The steps include:
1) Serializing raw event data (emails, logs, etc.) into a document format like Avro or JSON
2) Loading the serialized data into Pig for exploration and transformation
3) Publishing the data to a "database" like MongoDB
4) Building a web interface with tools like Sinatra, Bootstrap, and JavaScript to display and link individual records
The overall approach emphasizes rapid iteration, with the goal of creating an application that allows continuous discovery of insights from the source data.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
This document discusses how Cloudera Enterprise Data Hub (EDH) can be used for advanced analytics. EDH allows users to perform diverse concurrent analytics on large datasets without moving the data. It includes tools for machine learning, graph analytics, search, and statistical analysis. EDH protects data through security features and system change tracking. The document argues that EDH is the only platform that can support all these analytics capabilities in a single, integrated system. It provides several examples of how advanced analytics on EDH have helped organizations like the government address important problems.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
Tomáš Červenka will discuss Hive, an open-source data warehousing system built on Hadoop that provides SQL-like queries over large datasets. He will explain what Hive is useful for (big data analytics and processing), and not useful for (real-time queries and algorithms difficult to parallelize). He will demonstrate how to get started with Hive using Amazon EMR and provide a sample query, and discuss how VisualDNA uses Hive for analytics, reporting pipelines, and machine learning inference. Tips provided include using fast instance types, compression, and partitioning data.
LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
Dapper: the microORM that will change your lifeDavide Mauri
ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
This document summarizes a presentation about query-time nonparametric regression and time routed aliases in Solr. It discusses how nonparametric multiplicative regression was used to continuously predict user interests for an online career coaching system based on click-through data. It also describes how time routed aliases in Solr provide a built-in way to implement time-partitioned indexing of timestamped data across multiple collections while automatically adding and removing collections over time.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
The document discusses how traditional analytics processes involve siloed data and platforms, long timelines for data discovery, and difficulties accessing and sharing data. It proposes that an Enterprise Data Hub (EDH) using Cloudera can help address these issues by providing unified storage for all types of data, shorter analytics lifecycles, and the ability to do more with data by using 100x more data and more types of data. The EDH allows organizations to use all of their data and gain insights sooner.
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...OpenSource Connections
The document discusses building a lightweight discovery interface for Chinese patents using Solr/Lucene. It describes parsing various patent file formats using Tika and building custom parsers. It also emphasizes the importance of making the search solution accessible by allowing users to export data and share results.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
This document provides an introduction and agenda for a presentation on Spark. It discusses how Spark is a fast engine for large-scale data processing and how it improves on MapReduce. Spark stores data in memory across clusters to allow for faster iterative computations versus writing to disk with MapReduce. The presentation will demonstrate Spark concepts through word count and log analysis examples and provide an overview of Spark's Resilient Distributed Datasets (RDDs) and directed acyclic graph (DAG) execution model.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
2. 2
About Me…Bearding.
• Bearding is my #1 natural talent.
• I’m going to beat this guy.
• Seriously.
• Salty Sea Beard
• Fortified with Pacific Ocean Minerals
2
3. 3
Agile Data Science: The Book
A philosophy.
Not the only way,
but it’s a really good way!
Code: ‘AUTHD’ – 50% off
3
4. 4
We Go Fast, But Don’t Worry!
• Download the slides - click the links - read examples!
• If it’s not on the blog (Hortonworks, Data Syndrome), it’s in
the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do
4
7. 7
Scientific Computing / HPC
Tubes and Mercury (Old School) Cores and Spindles (New School)
UNIVAC and Deep Blue both fill a warehouse. We’re back!
7
‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop
9. 9
Data Center as Computer
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient
manner.” Click here for a paper on operating a ‘data center as computer.’
9
Warehouse Scale Computers and Applications
10. 10
Hadoop to the Rescue!
• Easy to use (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoa!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
10
12. 12
Analytics Apps: It takes a Team
• Broad skill-set
• Nobody has them all
• Inherently collaborative
12
13. 13
Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Creative workers. Like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most
13
14. 14
How To Get Insight Into Product
• Back-end has gotten THICKER
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• Can you collaborate on research vs developer timeline?
14
15. 15
The Wrong Way - Part One
“We made a great design.
Your job is to predict the future for it.”
15
16. 16
The Wrong Way - Part Two
“What is taking you so long
to reliably predict the future?”
16
17. 17
The Wrong Way - Part Three
“The users don’t understand
what 86% true means.”
17
18. 18
The Wrong Way - Part Four
GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!
18
19. 19
The Wrong Way - Conclusion
Inevitable Conclusion
Plane Mountain
19
21. 21
Chief Problem
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.
21
22. 22
-> Strategy
So make an app for exploring your data.
Which becomes a palette for what you ship.
Iterate and publish intermediate results.
22
23. 23
Data Design
• Not the 1st query that = insight, it’s the 15th, or 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch…
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in wrong statistical models
• Semantics of presenting predictions are complex
• Opportunity lies at intersection of data & design
23
26. 26
Setup An Environment Where:
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day Zero
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more
26
28. 28
Value Document > Relation
Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
28
29. 29
Value Document > Relation
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
29
30. 30
Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs
• Semi-structured documents preserve data’s actual structur
• Column compressed document formats beat JOINs!
30
31. 31
Value Imperative > Declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize
31
33. 33
Ex. Dataflow: ETL +
Email Sent Count
(I can’t read this either. Get a big version here.)
33
34. 34
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive.
See: HCatalog for Pig/Hive integration.
34
35. 35
Localhost vs Petabyte Scale:
Same Tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3
• Everything we serve in our app is re-creatable via Hadoop.
35
38. 38
0.0) Document - Serialize Events
• Protobuf
• Thrift
• JSON
• Avro - I use Avro because the schema is onboard.
38
39. 39
0.1) Documents Via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray,
name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray,
name:chararray), enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.
39
40. 40
0.2) Serialize Events From
Streamsclass GmailSlurper(object):
...
def init_imap(self, username, password):
self.username = username
self.password = password
try:
imap.shutdown()
except:
pass
self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
self.imap.login(username, password)
self.imap.is_readonly = True
...
def write(self, record):
self.avro_writer.append(record)
...
def slurp(self):
if(self.imap and self.imap_folder):
for email_id in self.id_list:
(status, email_hash, charset) = self.fetch_email(email_id)
if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
print email_id, charset, email_hash['thread_id']
self.write(email_hash)
Scrape your own gmail in Python and Ruby.
40
41. 41
0.3) ETL Logs
log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);
41
42. 42
1) Plumb Atomic Events->Browser
(Example stack that enables high productivity)
42
46. 46
1.4) Publish Events to a ‘Database’
pig -l /tmp -x local -v -w -param avros=enron.avro
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();
Full instructions here.
Which does this:
From Avro to MongoDB in one command:
46
51. 51
What’s the Point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!
51
52. 52
1.7) Wrap Events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>
Complete example here with code here.
52
55. 56
1.8) List Links to Sorted Events
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use Pig, serve/cache a bag/array of email documents:
Use your ‘database’, if it can sort.
56
57. 58
1.9) Make It Searchable
If you have list, search is easy with
ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch-
0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
Test it with curl:
ElasticSearch has no security features. Take note. Isolate.
58
60. 61
2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.
61
61. 62
2.1) Top N (of Anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
62
62. 63
2.2) Time Series (of Anything) in
Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
63
63. 64
Data Processing in Our Stack
A new feature in our application might begin at any layer…
GREAT!
Any team member can add new features, no problemo!
I’m creative!
I know Pig!
I’m creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
64
64. 65
Data Processing in Our Stack
... but we shift the data-processing towards batch, as we are able.
Ex: Overall total emails calculated in each layer
See real example here.
65
67. 68
3.0) From Charts to Reports
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.
68
70. 71
3.3) Get People Clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.
‘People’ could be just your team, if data is sensitive.
71
72. 73
4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!
73
73. 74
4.2) Think in Different
Perspectives
• Networks
• Time Series / Distributions
• Natural Language Processing
• Conditional Probabilities / Bayesian Inference
• Check out Chapter 2 of the book
74
84. 85
4.5.3) NLP for All: Extract Topics!
• TF-IDF in Pig - 2 lines of code with Pig Macros:
• http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-
topic-summarization-2-lines-of-pig/
• LDA with Pig and the Lucene Tokenizer:
• http://thedatachef.blogspot.be/2012/03/topic-discovery-
with-apache-pig-and.html
85
93. 94
Why Doesn’t Kate Reply
to My Emails?
• What time is best to catch her?
• Are they too long?
• Are they meant to be replied to (original content)?
• Are they nice? (sentiment analysis)
• Do I reply to her emails (reciprocity)?
• Do I cc the wrong people (my mom)?
94