This document provides instructions and examples for analyzing and visualizing event data in an agile manner. It discusses loading event data stored in Avro format using tools like Pig and displaying the data in a browser. Specific steps outlined include using Cat to view Avro data, loading the data into Pig and using Illustrate to view sample records. The overall approach emphasized is to work with atomic event data in an iterative way using Pig and other Hadoop tools to explore and visualize the data.
This document discusses building agile analytics applications. It recommends taking an iterative approach where data is explored interactively from the start to discover insights. Rather than designing insights upfront, the goal is to build an application that facilitates exploration of the data to uncover insights. This is done by setting up an environment where insights can be repeatedly produced and shared with the team. The focus is on using simple, flexible tools that work from small local data to large datasets.
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
The document describes a dataset containing on-time performance records for 95% of commercial flights in the United States. It includes over 30 fields of information for each flight such as airline, departure/arrival times, delays, distances, and causes of delays. An example record from the dataset is shown containing values for many of the fields.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
This document discusses building full stack data analytics applications using Apache Kafka and Apache Spark. It provides an overview of agile data science principles and methodologies. It also outlines various tools that can be used in the data pipeline and stack, such as Apache Spark, Apache Kafka, MongoDB, Elasticsearch, and d3.js. It discusses considerations for data structure and access patterns, as well as climbing the data value pyramid from raw data to higher order insights.
This document discusses building agile analytics applications. It recommends taking an iterative approach where data is explored interactively from the start to discover insights. Rather than designing insights upfront, the goal is to build an application that facilitates exploration of the data to uncover insights. This is done by setting up an environment where insights can be repeatedly produced and shared with the team. The focus is on using simple, flexible tools that work from small local data to large datasets.
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
The document describes a dataset containing on-time performance records for 95% of commercial flights in the United States. It includes over 30 fields of information for each flight such as airline, departure/arrival times, delays, distances, and causes of delays. An example record from the dataset is shown containing values for many of the fields.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
This document discusses building full stack data analytics applications using Apache Kafka and Apache Spark. It provides an overview of agile data science principles and methodologies. It also outlines various tools that can be used in the data pipeline and stack, such as Apache Spark, Apache Kafka, MongoDB, Elasticsearch, and d3.js. It discusses considerations for data structure and access patterns, as well as climbing the data value pyramid from raw data to higher order insights.
Social Network Analysis in Your Problem DomainRussell Jurney
This document discusses various types of networks that can be analyzed using social network analysis techniques. It provides examples of networks including founder networks, website behavior networks, online social networks, and email inbox networks. It also summarizes tools and methods for social network analysis including centrality measures, clustering, block models, cores, and dispersion analysis.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
This document summarizes a presentation on analyzing networks in problem domains. It provides examples of different types of networks that can be analyzed, including founder networks, website behavior networks, and online social networks. It also describes various tools and techniques for social network analysis, such as calculating centrality, clustering, and dispersion. The presentation emphasizes how to identify relevant entities and relationships to model a problem domain as a property graph and analyze it using graph databases and network analysis libraries.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
Speaker: Steve Touw, CTO, 42six Solutions a CSC Company
Amino is an open source analytical framework that focuses on a “building-blocks” approach to data discovery by pre-computing features about data at the most granular level possible and then allows analysts and data scientists to easily combine those features into more complex questions.
The magic behind Amino is found in it’s custom Accumulo index; that index strives to provide fast scans, highly dimensional scans, data compression, and a simple query structure. The index leverages Accumulo iterators to do much of the scan time logic which has no limit on dimensionality of the query. Iterators are what makes Accumulo unique and enables the Amino index to execute the complex queries.
This document discusses various heuristics and principles for architecture design. It provides guidelines for creating simplified, evolvable systems using small modular components. Some key points discussed include using open architectures, building in options, and designing structures that are resilient to stress. The document also advocates for pattern-oriented, minimalist designs and evolutionary systems that can adapt over time without disrupting existing information. Overall, the document presents best practices for handling complexity, enabling flexibility, and ensuring architectures can withstand failures.
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
So your boss says you need to learn data scienceSusan Ibach
Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
ISUM 2015 Keynote
Summary: Computational and Data Science is about extracting knowledge from data and modeling. This end goal can only be achieved through a craft that combines people, processes, computational and Big Data platforms, application-specific purpose and programmability. Publications and provenance of the data products products leading to these publications are also important. With this in mind, this talk defines a terminology for computational and data science applications, and discuss why focusing on these concepts is important for executability and reproducibility in computational and data science.
This document discusses bridging big data and data science using scalable workflows. It describes how scientific workflows can integrate various data science tools and processes to analyze large datasets. Workflows allow standardized, programmable, and reproducible analysis at scale. Examples are provided of workflows developed at the San Diego Supercomputer Center for applications in bioinformatics, wildfire management, and other domains. The document advocates conceptualizing computational analyses as workflows to facilitate collaboration between data scientists and developers.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
The document discusses using a graph database to store and query graph data stored in a Hadoop data lake more efficiently. It describes the limitations of the typical approach of using Spark/GraphFrames on HDFS for graph queries. A graph database allows for faster ad hoc graph queries by leveraging graph traversals. The document proposes using a multi-model database that combines a document store, graph database, and key-value store with a common query language. It suggests this approach could run on a DC/OS cluster for easy deployment and management of resources. Examples show importing data into ArangoDB and running graph queries.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Kepler is a scientific workflow system that facilitates the end-to-end computational scientific process of building, sharing, running, and learning from workflows. It provides features like an experiment-oriented workflow notebook, efficient data movement, multi-scale workflow building, and provenance tracking for reproducibility. Kepler builds upon the open-source Ptolemy II framework and is a cross-project collaboration involving multiple contributors and projects. Typical Kepler workflows consist of actors that perform tasks connected by data flow, and workflow parameters can be specified by users. Kepler enables programmable scalability for applications involving large-scale data access, computational analysis, and reuse across various scientific domains.
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
This document appears to be a slide deck presentation on empowering others through data exploration. The presentation discusses removing barriers to data, making feedback fast, and removing yourself from blocking others. It emphasizes visualizing data pipelines and augmenting data warehouses with data lakes to handle varying data volumes, varieties, and velocities. The goal is to turn data into insights that create business value.
Eclipse science group presentation given at Eclipse Converge and Devoxx 2017 in California. These slides give an overview of projects in the Eclipse Science working group in 2017.
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
The document outlines the workflow of a data scientist, from planning experiments and collecting data, to analyzing, visualizing, and publishing results. It emphasizes that data science involves formalizing hypotheses based on observations and testing them using collected data. A suite of open-source tools is presented to help data scientists in managing data and supporting open, reproducible life science research. The goal is to enable integration and sharing of experimental data and results.
It's all about introduction to a blog which speaks about Destinations, Arts, Culture, People, Cuisines...Everything you would want to know about Kerala
Discover Life. Feel Divinity. Find Yourself...........Experience God's Own Country
1) JSON-LD has seen widespread adoption with over 2 million HTML pages including it and it being a required format for Linked Data platforms.
2) A primary goal of JSON-LD was to allow JSON developers to use it similarly to JSON while also providing mechanisms to reshape JSON documents into a deterministic structure for processing.
3) JSON-LD 1.1 includes additional features like using objects to index into collections, scoped contexts, and framing capabilities.
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad
Social Network Analysis in Your Problem DomainRussell Jurney
This document discusses various types of networks that can be analyzed using social network analysis techniques. It provides examples of networks including founder networks, website behavior networks, online social networks, and email inbox networks. It also summarizes tools and methods for social network analysis including centrality measures, clustering, block models, cores, and dispersion analysis.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
This document summarizes a presentation on analyzing networks in problem domains. It provides examples of different types of networks that can be analyzed, including founder networks, website behavior networks, and online social networks. It also describes various tools and techniques for social network analysis, such as calculating centrality, clustering, and dispersion. The presentation emphasizes how to identify relevant entities and relationships to model a problem domain as a property graph and analyze it using graph databases and network analysis libraries.
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed.
To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not.
All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
Speaker: Steve Touw, CTO, 42six Solutions a CSC Company
Amino is an open source analytical framework that focuses on a “building-blocks” approach to data discovery by pre-computing features about data at the most granular level possible and then allows analysts and data scientists to easily combine those features into more complex questions.
The magic behind Amino is found in it’s custom Accumulo index; that index strives to provide fast scans, highly dimensional scans, data compression, and a simple query structure. The index leverages Accumulo iterators to do much of the scan time logic which has no limit on dimensionality of the query. Iterators are what makes Accumulo unique and enables the Amino index to execute the complex queries.
This document discusses various heuristics and principles for architecture design. It provides guidelines for creating simplified, evolvable systems using small modular components. Some key points discussed include using open architectures, building in options, and designing structures that are resilient to stress. The document also advocates for pattern-oriented, minimalist designs and evolutionary systems that can adapt over time without disrupting existing information. Overall, the document presents best practices for handling complexity, enabling flexibility, and ensuring architectures can withstand failures.
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
So your boss says you need to learn data scienceSusan Ibach
Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
ISUM 2015 Keynote
Summary: Computational and Data Science is about extracting knowledge from data and modeling. This end goal can only be achieved through a craft that combines people, processes, computational and Big Data platforms, application-specific purpose and programmability. Publications and provenance of the data products products leading to these publications are also important. With this in mind, this talk defines a terminology for computational and data science applications, and discuss why focusing on these concepts is important for executability and reproducibility in computational and data science.
This document discusses bridging big data and data science using scalable workflows. It describes how scientific workflows can integrate various data science tools and processes to analyze large datasets. Workflows allow standardized, programmable, and reproducible analysis at scale. Examples are provided of workflows developed at the San Diego Supercomputer Center for applications in bioinformatics, wildfire management, and other domains. The document advocates conceptualizing computational analyses as workflows to facilitate collaboration between data scientists and developers.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
The document discusses using a graph database to store and query graph data stored in a Hadoop data lake more efficiently. It describes the limitations of the typical approach of using Spark/GraphFrames on HDFS for graph queries. A graph database allows for faster ad hoc graph queries by leveraging graph traversals. The document proposes using a multi-model database that combines a document store, graph database, and key-value store with a common query language. It suggests this approach could run on a DC/OS cluster for easy deployment and management of resources. Examples show importing data into ArangoDB and running graph queries.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Kepler is a scientific workflow system that facilitates the end-to-end computational scientific process of building, sharing, running, and learning from workflows. It provides features like an experiment-oriented workflow notebook, efficient data movement, multi-scale workflow building, and provenance tracking for reproducibility. Kepler builds upon the open-source Ptolemy II framework and is a cross-project collaboration involving multiple contributors and projects. Typical Kepler workflows consist of actors that perform tasks connected by data flow, and workflow parameters can be specified by users. Kepler enables programmable scalability for applications involving large-scale data access, computational analysis, and reuse across various scientific domains.
Data science with Windows Azure - A Brief IntroductionAdnan Masood
Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
This document appears to be a slide deck presentation on empowering others through data exploration. The presentation discusses removing barriers to data, making feedback fast, and removing yourself from blocking others. It emphasizes visualizing data pipelines and augmenting data warehouses with data lakes to handle varying data volumes, varieties, and velocities. The goal is to turn data into insights that create business value.
Eclipse science group presentation given at Eclipse Converge and Devoxx 2017 in California. These slides give an overview of projects in the Eclipse Science working group in 2017.
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
The document outlines the workflow of a data scientist, from planning experiments and collecting data, to analyzing, visualizing, and publishing results. It emphasizes that data science involves formalizing hypotheses based on observations and testing them using collected data. A suite of open-source tools is presented to help data scientists in managing data and supporting open, reproducible life science research. The goal is to enable integration and sharing of experimental data and results.
It's all about introduction to a blog which speaks about Destinations, Arts, Culture, People, Cuisines...Everything you would want to know about Kerala
Discover Life. Feel Divinity. Find Yourself...........Experience God's Own Country
1) JSON-LD has seen widespread adoption with over 2 million HTML pages including it and it being a required format for Linked Data platforms.
2) A primary goal of JSON-LD was to allow JSON developers to use it similarly to JSON while also providing mechanisms to reshape JSON documents into a deterministic structure for processing.
3) JSON-LD 1.1 includes additional features like using objects to index into collections, scoped contexts, and framing capabilities.
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Blistering fast access to Hadoop with SQLSimon Harris
Big SQL, Impala, and Hive were benchmarked on their ability to execute 99 queries from the TPC-DS benchmark at various scale factors. Big SQL was able to express all queries without rewriting, complete the full workload at 10TB and 30TB, and achieved the highest throughput. Impala and Hive required rewriting some queries and could only complete 70-73% of the workload at 10TB. The results indicate that query support, scale, and throughput are important factors to consider for SQL-on-Hadoop implementations.
El documento describe los componentes básicos de los generadores eólicos. Explica que la energía eólica proviene de la energía solar y el calentamiento diferencial del aire por el sol. También menciona que existen diferentes tipos de aerogeneradores según su potencia y número de palas. Luego enumera los principales componentes como el rotor, las palas, el eje de baja velocidad, la caja multiplicadora, el sistema de orientación y el sistema de soporte.
El documento describe las características clave de un líder efectivo. Un líder debe tener la capacidad de comunicarse claramente, poseer inteligencia emocional para manejar los sentimientos propios y de otros, y establecer metas y objetivos congruentes con las capacidades del grupo. Además, un líder planea estratégicamente, aprovecha sus fortalezas y trabaja para mejorar sus debilidades, y ayuda a su gente a crecer delegando responsabilidades.
Consumers are interested in autonomous cars but still fear letting go of the wheel completely. While traffic is a major issue for city satisfaction, autonomous vehicles may help by freeing up drivers and improving the commute experience. Those most interested in autonomous cars tend to be professionals with children who already use cars to commute. Allowing cars to be shared more easily through technologies like digital keys could change whether people own cars or use them as a service. A variety of companies from traditional automakers to technology firms and public transport providers are seen as potential future providers of autonomous mobility options.
Zipcar is a car sharing service that allows users to rent vehicles by the hour or day. Members pay an annual fee of $70 plus hourly rates of $8.50 per hour or daily rates of $59. Zipcar has over 1 million members across 500 cities in 9 countries, with a fleet of 10,000 vehicles. The document outlines Zipcar's approach, history, competitors, and future outlook which includes increasing their fleet size and adding more hybrid and electric vehicles.
An overview of Teraproc cluster-as-a-service offerings for high-performance distributed analytics. This overview presentation includes a step-by-step demonstration of the process of deploying a ready-to-run R Studio cluster environment on Amazon Web Services. More information available at http://teraproc.com
This document provides word of the day definitions for the words "clever", "dainty", "pounce", and "generous" across four sections. Each section defines the word, provides part of speech, examples, and discussion questions related to demonstrating or applying that word. The overall document aims to build vocabulary and comprehension through engaging examples and questions about the different words.
La carta proporciona información sobre Michell Figueroa, un estudiante de la Universidad Fermín Toro en Barquisimeto. Figueroa está inscrito en la Facultad de Ciencias Jurídicas y Políticas, Escuela de Derecho, sección Saia D. La carta incluye su nombre completo y número de identificación.
How to Become a Data Scientist
SF Data Science Meetup, June 30, 2014
Video of this talk is available here: https://www.youtube.com/watch?v=c52IOlnPw08
More information at: http://www.zipfianacademy.com
Zipfian Academy @ Crowdflower
Detailed report of IBM's 30TB Hadoop-DS report showing that IBM InfoSphere BigInsights (SQL-on-Hadoop) is able to execute all 99 TPC-DS queries at scale over native Hadoop data formats. Written by Simon Harris, Abhayan Sundararajan, John Poelman and Matthew Emmerton.
La motivación laboral se refiere a la capacidad de las empresas para mantener el estímulo positivo de sus empleados y su desempeño en el trabajo. Existen cuatro tipos de motivación: extrínseca, intrínseca, transitiva y trascendente. La motivación es importante para las empresas porque mejora la productividad individual y grupal de los empleados. Algunos factores que motivan incluyen tener responsabilidades, autonomía y objetivos claros, mientras que problemas interpersonales, falta de confianza y exceso de control desmotivan.
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
This document provides an overview of steps to build an agile analytics application, beginning with raw event data and ending with a web application to explore and visualize that data. The steps include:
1) Serializing raw event data (emails, logs, etc.) into a document format like Avro or JSON
2) Loading the serialized data into Pig for exploration and transformation
3) Publishing the data to a "database" like MongoDB
4) Building a web interface with tools like Sinatra, Bootstrap, and JavaScript to display and link individual records
The overall approach emphasizes rapid iteration, with the goal of creating an application that allows continuous discovery of insights from the source data.
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
This document discusses how Cloudera Enterprise Data Hub (EDH) can be used for advanced analytics. EDH allows users to perform diverse concurrent analytics on large datasets without moving the data. It includes tools for machine learning, graph analytics, search, and statistical analysis. EDH protects data through security features and system change tracking. The document argues that EDH is the only platform that can support all these analytics capabilities in a single, integrated system. It provides several examples of how advanced analytics on EDH have helped organizations like the government address important problems.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
Tomáš Červenka will discuss Hive, an open-source data warehousing system built on Hadoop that provides SQL-like queries over large datasets. He will explain what Hive is useful for (big data analytics and processing), and not useful for (real-time queries and algorithms difficult to parallelize). He will demonstrate how to get started with Hive using Amazon EMR and provide a sample query, and discuss how VisualDNA uses Hive for analytics, reporting pipelines, and machine learning inference. Tips provided include using fast instance types, compression, and partitioning data.
Dapper: the microORM that will change your lifeDavide Mauri
ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
This document summarizes a presentation about query-time nonparametric regression and time routed aliases in Solr. It discusses how nonparametric multiplicative regression was used to continuously predict user interests for an online career coaching system based on click-through data. It also describes how time routed aliases in Solr provide a built-in way to implement time-partitioned indexing of timestamped data across multiple collections while automatically adding and removing collections over time.
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
This document provides an introduction and agenda for a presentation on Spark. It discusses how Spark is a fast engine for large-scale data processing and how it improves on MapReduce. Spark stores data in memory across clusters to allow for faster iterative computations versus writing to disk with MapReduce. The presentation will demonstrate Spark concepts through word count and log analysis examples and provide an overview of Spark's Resilient Distributed Datasets (RDDs) and directed acyclic graph (DAG) execution model.
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
The document discusses how traditional analytics processes involve siloed data and platforms, long timelines for data discovery, and difficulties accessing and sharing data. It proposes that an Enterprise Data Hub (EDH) using Cloudera can help address these issues by providing unified storage for all types of data, shorter analytics lifecycles, and the ability to do more with data by using 100x more data and more types of data. The EDH allows organizations to use all of their data and gain insights sooner.
Similar to Agile Data Science: Hadoop Analytics Applications (20)
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
2. About Me…Bearding.
• Bearding is my #1 natural talent.
• I’m going to beat this guy.
• Seriously.
• Salty Sea Beard
• Fortified with Pacific Ocean Minerals
2
3. Agile Data Science: The Book
A philosophy.
Not the only way,
but it’s a really good way!
Code: ‘AUTHD’ – 50% off
3
4. We Go Fast, But Don’t Worry!
• Download the slides - click the links - read examples!
• If it’s not on the blog (Hortonworks, Data Syndrome), it’s in
the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do
• Read Now @ Safari Rough Cuts
4
7. Scientific Computing / HPC
‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop
Tubes and Mercury (Old School)
Cores and Spindles (New School)
UNIVAC and Deep Blue both fill a warehouse. We’re back!
7
9. Data Center as Computer
Warehouse Scale Computers and Applications
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient
manner.” Click here for a paper on operating a ‘data center as computer.’
9
10. Hadoop to the Rescue!
• Easy to use (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoa!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
10
12. Analytics Apps: It takes a Team
• Broad skill-set
• Nobody has them all
• Inherently collaborative
12
13. Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Creative workers. Like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most
13
14. How To Get Insight Into Product
• Back-end has gotten THICKER
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• Can you collaborate on research vs developer timeline?
14
15. The Wrong Way - Part One
“We made a great design.
Your job is to predict the future for it.”
15
16. The Wrong Way - Part Two
“What is taking you so long
to reliably predict the future?”
16
17. The Wrong Way - Part Three
“The users don’t understand
what 86% true means.”
17
18. The Wrong Way - Part Four
GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!
18
19. The Wrong Way - Conclusion
Inevitable Conclusion
Plane
Mountain
19
21. Chief Problem
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.
21
22. -> Strategy
So make an app for exploring your data.
Iterate and publish intermediate results.
Which becomes a palette for what you ship.
22
23. Data Design
• Not the 1st query that = insight, it’s the 15th, or 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch…
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in wrong statistical models
• Semantics of presenting predictions are complex
• Opportunity lies at intersection of data & design
23
26. Setup An Environment Where:
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day Zero
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more
26
28. Value Document > Relation
Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
28
29. Value Document > Relation
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
29
30. Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs
• Semi-structured documents preserve data’s actual structur
• Column compressed document formats beat JOINs!
30
31. Value Imperative > Declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize
31
33. Ex. Dataflow: ETL +
Email Sent Count
(I can’t read this either. Get a big version here.)
33
34. Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive.
See: HCatalog for Pig/Hive integration.
34
35. Localhost vs Petabyte Scale:
Same Tools Tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3
• Everything we serve in our app is re-creatable via Hadoop.
35
38. 0.0) Document - Serialize Events
• Protobuf
• Thrift
• JSON
• Avro - I use Avro because the schema is onboard.
38
39. 0.1) Documents Via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray,
name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray,
name:chararray),
enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.
39
40. 0.2) Serialize Events From
Streams
class GmailSlurper(object):
...
def init_imap(self, username, password):
self.username = username
self.password = password
try:
imap.shutdown()
except:
pass
self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
self.imap.login(username, password)
self.imap.is_readonly = True
...
def write(self, record):
self.avro_writer.append(record)
...
def slurp(self):
if(self.imap and self.imap_folder):
for email_id in self.id_list:
(status, email_hash, charset) = self.fetch_email(email_id)
if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
print email_id, charset, email_hash['thread_id']
self.write(email_hash)
Scrape your own gmail in Python and Ruby.
40
41. 0.3) ETL Logs
log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);
41
42. 1) Plumb Atomic Events->Browser
(Example stack that enables high productivity)
42
46. 1.4) Publish Events to a ‘Database’
From Avro to MongoDB in one command:
pig -l /tmp -x local -v -w -param avros=enron.avro
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
Which does this:
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();
Full instructions here.
46
51. What’s the Point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!
51
52. 1.7) Wrap Events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>
Complete example here with code here.
52
55. 1.8) List Links to Sorted Events
Use Pig, serve/cache a bag/array of email documents:
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use your ‘database’, if it can sort.
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...
56
57. 1.9) Make It Searchable
If you have list, search is easy with
ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
Test it with curl:
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
ElasticSearch has no security features. Take note. Isolate.
58
60. 2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.
61
61. 2.1) Top N (of Anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
62
62. 2.2) Time Series (of Anything) in
Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
63
63. Data Processing in Our Stack
A new feature in our application might begin at any layer…
GREAT!
I’m creative!
I know Pig!
I’m creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
Any team member can add new features, no problemo!
64
64. Data Processing in Our Stack
... but we shift the data-processing towards batch, as we are able.
See real example here.
Ex: Overall total emails calculated in each layer
65
67. 3.0) From Charts to Reports
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.
68
70. 3.3) Get People Clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.
‘People’ could be just your team, if data is sensitive.
71
72. 4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!
73
73. 4.2) Think in Different
Perspectives
• Networks
• Time Series / Distributions
• Natural Language Processing
• Conditional Probabilities / Bayesian Inference
• Check out Chapter 2 of the book
74
84. 4.5.3) NLP for All: Extract Topics!
• TF-IDF in Pig - 2 lines of code with Pig Macros:
• http://hortonworks.com/blog/pig-macro-for-tf-idf-makestopic-summarization-2-lines-of-pig/
• LDA with Pig and the Lucene Tokenizer:
• http://thedatachef.blogspot.be/2012/03/topic-discoverywith-apache-pig-and.html
85
93. Why Doesn’t Kate Reply
to My Emails?
• What time is best to catch her?
• Are they too long?
• Are they meant to be replied to (original content)?
• Are they nice? (sentiment analysis)
• Do I reply to her emails (reciprocity)?
• Do I cc the wrong people (my mom)?
94
94. Example: Packetpig
and PacketLoop
snort_alerts = LOAD '$pcap'
USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');
countries = FOREACH snort_alerts
GENERATE
com.packetloop.packetpig.udf.geoip.Country(src) as country,
priority;
countries = GROUP countries BY country;
countries = FOREACH countries
GENERATE
group,
AVG(countries.priority) as average_severity;
STORE countries into 'output/choropleth_countries' using PigStorage(',');
Code here.
95