The document discusses Pattern, an open source project for migrating predictive models from SAS, R, etc. onto Hadoop. Pattern works on top of Cascading to support scoring of PMML models at scale on Hadoop. Models are reused and deployed within Cascading workflows. The document provides an example of exporting a random forest model trained in R to PMML, and then using Pattern to score sample data on Hadoop using that PMML model. It also discusses Cascading and PMML in more detail.
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
Introduction to Apache Drill - NYC Apache Drill MeetupVince Gonzalez
This document discusses Apache Drill, an open source SQL query engine for exploring large datasets. It provides examples of using Drill to query JSON data stored in various data sources like HDFS, HBase and Hive in SQL format. Drill allows for schema-free queries on structured and unstructured data with low latency. Use cases mentioned include a new restaurant in Las Vegas querying Yelp data to find top reviewers for an opening party.
Data Exploration with Apache Drill: Day 1Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Data Exploration with Apache Drill: Day 2Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
Presentation to the Boulder/Denver BigData meetup 2013-09-25 http://www.meetup.com/Boulder-Denver-Big-Data/events/131047972/
Overview of Enterprise Data Workflows with Cascading; code samples in Cascading, Cascalog, Scalding; Lingual and Pattern Examples; An Evolution of Cluster Computing based on Apache Mesos, with use cases
Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
The document provides an overview of big data analytics and Hadoop. It defines big data and the challenges of working with large, complex datasets. It then discusses Hadoop as an open-source framework for distributed storage and processing of big data across clusters of commodity hardware. Key components of Hadoop include HDFS for storage, MapReduce for parallel processing, and other tools like Pig, Hive, HBase etc. The document provides examples of how Hadoop is used by many large companies and describes the architecture and basic functions of HDFS and MapReduce.
This document discusses the landscape of big data query processing and summarizes different approaches. It introduces N1QL, a SQL-like query language for NoSQL databases, and compares it to other query languages. The document also discusses ongoing research into unifying query languages and systems to allow both operational and analytical query processing over heterogeneous and nested data models using a single interface.
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
Introduction to Apache Drill - NYC Apache Drill MeetupVince Gonzalez
This document discusses Apache Drill, an open source SQL query engine for exploring large datasets. It provides examples of using Drill to query JSON data stored in various data sources like HDFS, HBase and Hive in SQL format. Drill allows for schema-free queries on structured and unstructured data with low latency. Use cases mentioned include a new restaurant in Las Vegas querying Yelp data to find top reviewers for an opening party.
Data Exploration with Apache Drill: Day 1Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Data Exploration with Apache Drill: Day 2Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
Presentation to the Boulder/Denver BigData meetup 2013-09-25 http://www.meetup.com/Boulder-Denver-Big-Data/events/131047972/
Overview of Enterprise Data Workflows with Cascading; code samples in Cascading, Cascalog, Scalding; Lingual and Pattern Examples; An Evolution of Cluster Computing based on Apache Mesos, with use cases
Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
The document provides an overview of big data analytics and Hadoop. It defines big data and the challenges of working with large, complex datasets. It then discusses Hadoop as an open-source framework for distributed storage and processing of big data across clusters of commodity hardware. Key components of Hadoop include HDFS for storage, MapReduce for parallel processing, and other tools like Pig, Hive, HBase etc. The document provides examples of how Hadoop is used by many large companies and describes the architecture and basic functions of HDFS and MapReduce.
This document discusses the landscape of big data query processing and summarizes different approaches. It introduces N1QL, a SQL-like query language for NoSQL databases, and compares it to other query languages. The document also discusses ongoing research into unifying query languages and systems to allow both operational and analytical query processing over heterogeneous and nested data models using a single interface.
Kaaahwa Armstrong completed a field attachment internship at Uganda Broadcasting Corporation (UBC) in their IT department. During the internship, they gained experience in networking, web development, and computer maintenance. Specifically, they learned how to set up local and wide area networks, configure routers, install and configure MySQL databases, and perform basic hardware repairs. The internship provided valuable hands-on experience in key IT skills and improved Kaaahwa's technical abilities.
There is one consistent message we hear from customers across industries and around the world: "We would like to reduce our reliance on SAS." In this webinar, we review the top reasons customers cite for moving fromSAS to R; the benefits of open source analytics; the challenges of switching; and the tools you will need to build your own roadmap. We review the key differences between SAS and R from the user's perspective, and provide you with the tools to move forward.
This document discusses scaling machine learning models from a laboratory setting to production. It proposes using a standardized representation called PMML to capture models produced by R and Scikit-Learn. PMML allows models to be deployed across different frameworks and languages. The document outlines APIs for evaluating, maintaining, and integrating models as reusable functions within data pipelines in Hadoop ecosystems like Spark, Pig, and Cascading. The goal is a portable, platform-agnostic architecture for operationalizing machine learning based on open standards.
The document discusses how the media product uses and develops conventions of real R&B pop music products. It analyzes the conventions used in the music video, such as set design, lighting, camera work, costumes, and editing. It also compares the album artwork and website to real artists like Beyoncé, Rihanna, and Mary J. Blige. While some conventions are followed, like close-ups and social media promotion, others are challenged, like changing photos on the website and opening the music video with the rapper instead of the female artist. Overall, the goal was to attract audiences while establishing the new artist within the R&B pop genre.
El documento se compone de una sola página que repite continuamente la dirección de un blog. No proporciona ninguna información sobre el contenido del blog.
We need to know the basics of responsive design, as well as the benefits and drawbacks in order to be active participants in conversations about software trends and innovation. We'll walk through what exactly responsive design is, show examples of it’s implementation, and talk about next steps and what this means for your individual role.
Dokumen tersebut menjelaskan tentang Unified Modeling Language (UML) yang merupakan alat bantu yang digunakan dalam bahasa pemograman berorientasi objek. UML digunakan untuk membuat model desain seperti use case, activity diagram, sequence diagram, dan class diagram untuk membantu pengembangan perangkat lunak. Model desain diperlukan untuk memberikan rancangan kepada programmer dalam pembuatan aplikasi atau perangkat lunak.
This document discusses fashion and customization inspiration from Pim Kramer's studio Unicps. It covers various crafting techniques like stick and stitch, silhouettes and forms, mixing and matching, printing and painting, 3D effects and draping. The presentation was created for educational purposes and contains visual materials from workshops and the fashion design studio found on the web.
This document provides a brief overview of fashion history from 1898-1970s including magazines, designers, artists, materials, and styles that served as inspiration for fashion illustrations. Key figures and works mentioned include Lanvin couture in the 1930s-50s, Wiener Werkstatte in 1907, and artists like Klimt, Bakst, Bassman, and Renie. The document was created by Pim Kramer in 2008 to showcase influences for fashion illustration.
Valgen Case Study - Midsized Apparel ManufacturerValgenMobility
The document discusses a case study of an apparel manufacturing company that wants to increase production volumes without increasing capacity by measuring process lead times and monitoring productivity. It provides sample data on production processes, lead times, queue delays, and stitch times achieved versus required times. The data is being collected directly from the source using a mobile phone to track key metrics without needing additional software.
MAS Leather Int. is a manufacturer and exporter of leather products located in Sialkot, Pakistan. The company was founded in 1989 and has grown to become one of the largest producers of leather safety products in Pakistan. MAS Leather produces a variety of gloves, jackets, aprons, and other leather accessories for welding, construction, and other industries. The company has an extensive manufacturing process that includes tanneries, cutting machinery, stitching facilities, and packaging and shipping capabilities. MAS Leather prides itself on providing high quality leather products at competitive prices along with excellent customer service.
Stich ii trial for supratentorial intra cerebral bleedgarry07
This document summarizes the STICH II trial which compared early surgery versus initial conservative treatment for spontaneous supratentorial lobar intracerebral haemorrhages between 10-100mL. The trial involved 601 patients randomized to either early surgery within 12 hours or initial conservative treatment with delayed surgery if needed. The primary outcome of death or severe disability at 6 months showed no significant difference between the groups. Secondary outcomes of mortality, functional scales also showed no significant differences, indicating that early surgery did not improve outcomes over initial conservative treatment for these types of hemorrhages.
Critical appraisal of Stitch Trial by Dr. Akshay Mehtacardiositeindia
The STICH trial tested two hypotheses regarding the treatment of ischemic heart failure:
1) Adding CABG to medical therapy improves long-term survival more than medical therapy alone.
2) For patients with anterior left ventricular dysfunction, surgical ventricular reconstruction plus CABG and medical therapy improves survival free of cardiac hospitalization more than CABG and medical therapy without ventricular reconstruction. The trial randomized over 1200 patients to test these hypotheses but did not find conclusive evidence to support either the primary or secondary hypotheses.
The document is a purchase order from COS for 10,000 white cotton canvas men's shirts from designer Md. Sajjadul Karim Bhuiyan. It provides the style number, date, fabric composition, size range, and detailed stitching and hemming specifications that the designer must follow to develop the shirts for the spring/summer season.
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...VsimPPT
Завантаження доступне на http://vsimppt.com.ua/
-------
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний інтерфейс операційної системи
Soybean is an important cash crop in Southern Africa
Demand is driven by the growing poultry industry
Productivity of soybean is <1 t/ha due to low adoption of improved varieties and agronomic practices
Low adoption is due to limited availability and affordability of seed of improved varieties
The document summarizes the transportation problem and various methods to solve it. It discusses the transportation problem aims to find the optimal transportation schedule to minimize transportation costs. It describes the North West Corner Method, Least Cost Method, and Vogel's Approximation Method to solve transportation problems. It provides steps for the Vogel's Approximation Method, which includes checking for a basic feasible solution and revising solutions using a loop method if positive check numbers exist.
4. STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGEAELC
This document provides an overview of a student's thesis on analyzing the variation of joint forces in steel truss bridges. The objectives are to understand steel truss bridge components and design, perform influence line analysis using STAAD-Pro software, and study joint force variations. The scope will involve designing a simple span parallel chord Warren truss bridge superstructure to AASHTO standards with HS20-24 live loading. Implementation will include modeling the bridge in STAAD-Pro and analyzing joints. The document also covers characteristics, advantages, disadvantages and components of steel truss bridges.
The fertilizer industry in Pakistan is important for agriculture as soils are deficient in nutrients. There are 10 fertilizer plants, with urea being the main nitrogenous fertilizer produced. Demand for fertilizers is increasing due to population growth. However, the industry faces issues with reduced gas supply, high production costs for DAP due to imported raw materials, and uneducated farmers leading to unbalanced fertilizer use. Solutions include exploring new gas reserves, incentives for local DAP production, and educating farmers on balanced fertilizer application.
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
Ryan Desmond's Presentation at the Cascading Meetup on August 27, 2015. Brief overview of Cascading to help give a basic understanding to Clojure users that might use PigPen & Clojure to access Cascading.
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati
Ken and Fonda will talk through how organizations are embracing open source machine learning and AI platforms and what strategies to use to make the transformation easier.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Kaaahwa Armstrong completed a field attachment internship at Uganda Broadcasting Corporation (UBC) in their IT department. During the internship, they gained experience in networking, web development, and computer maintenance. Specifically, they learned how to set up local and wide area networks, configure routers, install and configure MySQL databases, and perform basic hardware repairs. The internship provided valuable hands-on experience in key IT skills and improved Kaaahwa's technical abilities.
There is one consistent message we hear from customers across industries and around the world: "We would like to reduce our reliance on SAS." In this webinar, we review the top reasons customers cite for moving fromSAS to R; the benefits of open source analytics; the challenges of switching; and the tools you will need to build your own roadmap. We review the key differences between SAS and R from the user's perspective, and provide you with the tools to move forward.
This document discusses scaling machine learning models from a laboratory setting to production. It proposes using a standardized representation called PMML to capture models produced by R and Scikit-Learn. PMML allows models to be deployed across different frameworks and languages. The document outlines APIs for evaluating, maintaining, and integrating models as reusable functions within data pipelines in Hadoop ecosystems like Spark, Pig, and Cascading. The goal is a portable, platform-agnostic architecture for operationalizing machine learning based on open standards.
The document discusses how the media product uses and develops conventions of real R&B pop music products. It analyzes the conventions used in the music video, such as set design, lighting, camera work, costumes, and editing. It also compares the album artwork and website to real artists like Beyoncé, Rihanna, and Mary J. Blige. While some conventions are followed, like close-ups and social media promotion, others are challenged, like changing photos on the website and opening the music video with the rapper instead of the female artist. Overall, the goal was to attract audiences while establishing the new artist within the R&B pop genre.
El documento se compone de una sola página que repite continuamente la dirección de un blog. No proporciona ninguna información sobre el contenido del blog.
We need to know the basics of responsive design, as well as the benefits and drawbacks in order to be active participants in conversations about software trends and innovation. We'll walk through what exactly responsive design is, show examples of it’s implementation, and talk about next steps and what this means for your individual role.
Dokumen tersebut menjelaskan tentang Unified Modeling Language (UML) yang merupakan alat bantu yang digunakan dalam bahasa pemograman berorientasi objek. UML digunakan untuk membuat model desain seperti use case, activity diagram, sequence diagram, dan class diagram untuk membantu pengembangan perangkat lunak. Model desain diperlukan untuk memberikan rancangan kepada programmer dalam pembuatan aplikasi atau perangkat lunak.
This document discusses fashion and customization inspiration from Pim Kramer's studio Unicps. It covers various crafting techniques like stick and stitch, silhouettes and forms, mixing and matching, printing and painting, 3D effects and draping. The presentation was created for educational purposes and contains visual materials from workshops and the fashion design studio found on the web.
This document provides a brief overview of fashion history from 1898-1970s including magazines, designers, artists, materials, and styles that served as inspiration for fashion illustrations. Key figures and works mentioned include Lanvin couture in the 1930s-50s, Wiener Werkstatte in 1907, and artists like Klimt, Bakst, Bassman, and Renie. The document was created by Pim Kramer in 2008 to showcase influences for fashion illustration.
Valgen Case Study - Midsized Apparel ManufacturerValgenMobility
The document discusses a case study of an apparel manufacturing company that wants to increase production volumes without increasing capacity by measuring process lead times and monitoring productivity. It provides sample data on production processes, lead times, queue delays, and stitch times achieved versus required times. The data is being collected directly from the source using a mobile phone to track key metrics without needing additional software.
MAS Leather Int. is a manufacturer and exporter of leather products located in Sialkot, Pakistan. The company was founded in 1989 and has grown to become one of the largest producers of leather safety products in Pakistan. MAS Leather produces a variety of gloves, jackets, aprons, and other leather accessories for welding, construction, and other industries. The company has an extensive manufacturing process that includes tanneries, cutting machinery, stitching facilities, and packaging and shipping capabilities. MAS Leather prides itself on providing high quality leather products at competitive prices along with excellent customer service.
Stich ii trial for supratentorial intra cerebral bleedgarry07
This document summarizes the STICH II trial which compared early surgery versus initial conservative treatment for spontaneous supratentorial lobar intracerebral haemorrhages between 10-100mL. The trial involved 601 patients randomized to either early surgery within 12 hours or initial conservative treatment with delayed surgery if needed. The primary outcome of death or severe disability at 6 months showed no significant difference between the groups. Secondary outcomes of mortality, functional scales also showed no significant differences, indicating that early surgery did not improve outcomes over initial conservative treatment for these types of hemorrhages.
Critical appraisal of Stitch Trial by Dr. Akshay Mehtacardiositeindia
The STICH trial tested two hypotheses regarding the treatment of ischemic heart failure:
1) Adding CABG to medical therapy improves long-term survival more than medical therapy alone.
2) For patients with anterior left ventricular dysfunction, surgical ventricular reconstruction plus CABG and medical therapy improves survival free of cardiac hospitalization more than CABG and medical therapy without ventricular reconstruction. The trial randomized over 1200 patients to test these hypotheses but did not find conclusive evidence to support either the primary or secondary hypotheses.
The document is a purchase order from COS for 10,000 white cotton canvas men's shirts from designer Md. Sajjadul Karim Bhuiyan. It provides the style number, date, fabric composition, size range, and detailed stitching and hemming specifications that the designer must follow to develop the shirts for the spring/summer season.
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний ін...VsimPPT
Завантаження доступне на http://vsimppt.com.ua/
-------
Урок 8 для 6 класу - Поняття операційної системи, її призначення Графічний інтерфейс операційної системи
Soybean is an important cash crop in Southern Africa
Demand is driven by the growing poultry industry
Productivity of soybean is <1 t/ha due to low adoption of improved varieties and agronomic practices
Low adoption is due to limited availability and affordability of seed of improved varieties
The document summarizes the transportation problem and various methods to solve it. It discusses the transportation problem aims to find the optimal transportation schedule to minimize transportation costs. It describes the North West Corner Method, Least Cost Method, and Vogel's Approximation Method to solve transportation problems. It provides steps for the Vogel's Approximation Method, which includes checking for a basic feasible solution and revising solutions using a loop method if positive check numbers exist.
4. STUDY ONVARIATION OF JOINT FORCES IN STEEL TRUSS BRIDGEAELC
This document provides an overview of a student's thesis on analyzing the variation of joint forces in steel truss bridges. The objectives are to understand steel truss bridge components and design, perform influence line analysis using STAAD-Pro software, and study joint force variations. The scope will involve designing a simple span parallel chord Warren truss bridge superstructure to AASHTO standards with HS20-24 live loading. Implementation will include modeling the bridge in STAAD-Pro and analyzing joints. The document also covers characteristics, advantages, disadvantages and components of steel truss bridges.
The fertilizer industry in Pakistan is important for agriculture as soils are deficient in nutrients. There are 10 fertilizer plants, with urea being the main nitrogenous fertilizer produced. Demand for fertilizers is increasing due to population growth. However, the industry faces issues with reduced gas supply, high production costs for DAP due to imported raw materials, and uneducated farmers leading to unbalanced fertilizer use. Solutions include exploring new gas reserves, incentives for local DAP production, and educating farmers on balanced fertilizer application.
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
Ryan Desmond's Presentation at the Cascading Meetup on August 27, 2015. Brief overview of Cascading to help give a basic understanding to Clojure users that might use PigPen & Clojure to access Cascading.
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati
Ken and Fonda will talk through how organizations are embracing open source machine learning and AI platforms and what strategies to use to make the transformation easier.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
The document discusses challenges with building operational data applications on Hadoop and introduces the Cask Data Application Platform (CDAP) as a solution. It provides an agenda that covers data applications, challenges, CDAP motivation and goals, use cases, and an introduction and architecture overview of CDAP. The document aims to demonstrate how CDAP provides a unified platform that simplifies application development and lifecycle while supporting reusable data and processing patterns.
Sparkling Water Webinar October 29th, 2014Sri Ambati
Sparkling Water is the newest application on the Apache Spark in-memory platform to extend Machine Learning for better predictions and to quickly deploy models into production. H2O is proud to partner with Cloudera and Databricks to bring this capability to a wide audience.
H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFast¬TM Scoring Engine. Learn more by going to http://www.h2o.ai and contact us for more information.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
Pre-Con Ed: There has to be a Better Way to Fast Test Coverage!CA Technologies
The document discusses how CA Test Data Manager and CA Agile Requirements Designer can work together to help organizations improve test coverage without as much hassle. It provides an overview of the key capabilities of each product and demonstrates in a use case how they can be integrated in a DevOps workflow. Specifically, it shows how test data can be synthetically generated in CA Test Data Manager, a use case can be modeled with paths and requirements in CA Agile Requirements Designer, and how that model can be used to automate testing of the data generation.
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
We are living in the world of abundant data, so called “big data”. The term “big data” is closely associated with unstructured data. They are called “unstructured” or NoSQL data because they do not fit neatly in a traditional row-column relational database. A NoSQL (Not only SQL or Non-relational SQL) database is the type of database that can handle unstructured data. For example, a NoSQL database can store unstructured data such as XML (Extensible Markup Language), JSON (JavaScript Object Notation) or RDF (Resource Description Framework) files.
If an enterprise is able to extract unstructured data from NoSQL databases and transfer it to the SAS environment for analysis, this will produce tremendous value, especially from a big data solutions standpoint. This paper will show how unstructured data is stored in the NoSQL databases and ways to transfer it to the SAS environment for analysis. First, the paper will introduce the NoSQL database. For example, NoSQL databases can store unstructured data such as XML, JSON or RDF files. Secondly, the paper will show how the SAS system connects to NoSQL databases using REST (Representational State Transfer) API (Application Programming Interface). For example, SAS programmers can use the PROC HTTP option to extract XML or JSON files through REST API from the NoSQL database. Finally, the paper will show how SAS programmers can convert XML and JSON files to SAS datasets for analysis. For example, SAS programmers can create XMLMap files using XMLV2 LIBNAME engine and convert the extracted XML files to SAS datasets.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
This document provides an introduction to Apache Spark and collaborative filtering. It discusses big data and the limitations of MapReduce, then introduces Apache Spark including Resilient Distributed Datasets (RDDs), transformations, actions, and DataFrames. It also covers Spark Machine Learning (ML) libraries and algorithms such as classification, regression, clustering, and collaborative filtering.
This document discusses Tableau's role in big data architectures and its integration with Hadoop. It outlines different workload categories for business intelligence and their considerations for Tableau. Three integration models are described: isolated exploration, live interactive query, and integrated advanced analytics. Capability models are presented for each integration approach regarding suitability for Hadoop. Finally, architecture patterns are shown for isolated exploration, live interactive querying, and an integrated advanced analytics platform using Tableau and Hadoop.
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
This document provides an overview of MetaQL, which allows composing queries across NoSQL, SQL, SPARQL, and Spark databases using a domain model. Key points include:
- MetaQL uses a domain model to define concepts and compose typed queries in code that can execute across different databases.
- This separates concerns and improves developer efficiency over managing schemas and databases separately.
- Examples demonstrate MetaQL queries in graph, path, select, and aggregation formats across SQL, NoSQL, and RDF implementations.
1. The document summarizes steps towards integrating the H2O and Spark frameworks, including allowing data sharing between Spark and H2O.
2. A demonstration is shown of loading airline data from a CSV into a Spark SQL table, querying the table, and transferring the results to an H2O frame to run a GBM algorithm.
3. Next steps discussed include optimizing data transfers between Spark and H2O, developing an H2O backend for MLlib, and addressing open challenges in areas like transferring results and supporting Parquet.
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology and Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. The combination of the two can provide a solution to power advanced analytics for not only what has happened in the past, but make intelligent predictions about the future. Please join this webinar to learn how get the most value from your data for your data driven business.
Learning Objectives:
How to scale your Redshift queries with user-defined functions (UDFs)
How to apply Machine learning to historical data in Amazon Redshift
How to visualize your data with Amazon QuickSight
Present a reference architecture for advanced analytics
Who Should Attend:
Application developers looking to add UDFs, or predictive analytics to their applications, database administrators that need to meet the demand of data driven organizations, decision makers looking to derive more insight from their data
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
Hadoop MapReduce provides transparent parallelization but often results in specialized code bases that interact with low-level data formats. We present a means of using dependency injection to manage data flows in MapReduce which in turn supports reusable, Hadoop-agnostic application code that interacts with high-level business domain objects. An example is provided that applies Dependency Injection to the Hadoop WordCount example and shows how the same code invoked from the WordCount MapReduce job can be reused in a real-time context. We then discuss Opower’s application of this pattern to employ the same core calculations in both batch processing and in servicing real-time requests from end users. This topic will be of interest to those interested in reusing core batch calculations in real-time contexts. It also provides a means forward for organizations moving to Hadoop that have existing code components that they would like to employ in batch MapReduce computations.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
Similar to Pattern: An Open Source Project for Migrating Predictive Models from SAS (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
2. Copyright 2014, Concurrent Inc.Confidential2
Pattern is:!
!
• An open source project that works on top of Cascading to support
scoring of PMML models (from R, SAS, etc.) at scale on to
Hadoop.!
!
• Models are reused and deployed within Cascading workflows.
PATTERN IN A NUTSHELL
Copyright 2014, Concurrent Inc.
4. Experiments – comparing models
• Much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
• run multiple variants, then measure relative “lift”
• Concurrent runtime – tag and track models
!
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
5. ## load the "baseline" reference data!
dat_folder <- '.'!
data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="",
na.strings="NULL", header=TRUE, encoding="UTF8")!
!
## split data into test and train sets!
set.seed(71)!
split_ratio <- 2/10!
split <- round(dim(data)[1] * split_ratio)!
data_tests <- data[1:split,]!
!
data_train <- data[(split + 1):dim(data)[1],]!
i <- colnames(data_train) == "order_id"!
j <- 1:length(i)!
data_train <- data_train[,-j[i]]!
!
## train a RandomForest model!
f <- as.formula("as.factor(label) ~ .")!
fit <- randomForest(f, data_train, ntree=25)!
!
## test the model on the holdout test set!
print(fit$importance)!
print(fit)!
!
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 13.12%!
Confusion matrix:!
0 1 class.error!
0 57 9 0.1363636!
1 12 82 0.1276596
7. In pattern/pattern-examples!
!
gradle clean jar!
!
hadoop dfs -rmr out/classify!
!
hadoop jar build/libs/pattern-examples-*.jar data/
sample.tsv out/classify --pmml data/
antifraud.rf.xml!
!
hadoop dfs -cat out/classify/part-*
Experiments – Random Forest model
8. Copyright 2014, Concurrent Inc.
CASCADING OVERVIEW
8
•Enterprise Grade - Proven application development
framework for building robust and complex Big Data
applications with thousands of deployments.
•Productive - Cascading relies on software patterns to
provide optimal level of abstraction allowing to greatly
simplify creation, testing, deployment and operation of
applications by focusing on business logic first.
•Flexible & Extensible - Runs on all popular Hadoop
distributions, but not limited to Hadoop. Easily
extensible framework supporting a variety of
extensions, tools, and other integrations.
Hadoop
On-Premise or Cloud
Data Applications!
ETL, Analytics, Data
Processing, Machine Learning
Copyright 2014, Concurrent Inc.
9. Copyright 2014, Concurrent Inc.
CASCADING ECOSYSTEM
9
On-Premise Deployments
Other Data StoresHadoop Distributions
ClojureSQL
RDBMS
MPP
EDW
LINGUALPATTERNSCALDINGCASCALOG
Languages
Copyright 2014, Concurrent Inc.
10. Copyright 2014, Concurrent Inc.Confidential
BUSINESSES DEPEND ON US
10
>30% of Marketplace’s 1000 node
Hadoop cluster runs Cascading
applications
Cascading powers Revenues, Publisher
Analytics, and User Engagement
applications
Built their business for weather
insurance using using Cascading
Sold to Monsanto for $950MM
Standardize on Cascading for their
fraud detection business
Copyright 2014, Concurrent Inc.
11. Copyright 2014, Concurrent Inc.ConfidentialConfidential
CASCADING - DATA APPS
11
Enterprise IT!
Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis
!
Corporate Apps!
HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting
!
Telecom!
Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services
Marketing / Retail!
Mobile, Social, Search Analytics
Funnel analysis
Revenue attribution
Customer experiments
Ad Optimization
Retail recommenders
!
Consumer / Entertainment!
Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast
!
!
Finance!
Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric
!
Health / Biotech!
Aggregate metrics for Govt
Person biometrics
Veterinary diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps
!
Copyright 2014, Concurrent Inc.
12. Copyright 2014, Concurrent Inc.12
The Cascading processing model is based on a metaphor of flows based on patterns
Source Tap
Sink Tap
Pipe
Tuple
Stream
Pipe
Assembly
Flow
Copyright 2014, Concurrent Inc.
CASCADING - MODEL METAPHOR
. Data is represented as flows of tuples.
. Pipes allow you to manage a data flow through functional programming:
. Splitting, Merging, Filtering, Parsing, Transforming, Grouping, Aggregating, Buffering, Joining, etc.
! +!
Fields
13. Copyright 2014, Concurrent Inc.Confidential13
CASCADE
Copyright 2014, Concurrent Inc.
• Cascade joins together multiple flows and execute them based on dependencies.
14. Copyright 2014, Concurrent Inc.Confidential14
• Flow planners allow Flows to be independent from the execution platform and the
processing query planner is responsible for defining, sharing, and executing data-
processing workflows
• Currently there are two kinds of flow planners
- Local
- Hadoop (1 & 2)
• Allows for “fail fast”
- The flow planners can check completeness of flows, operations, type safety, etc.
• Maps the pipe assembly to MapReduce in a deterministic way
FLOWS EXECUTION
Copyright 2014, Concurrent Inc.
20. Copyright 2014, Concurrent Inc.Confidential20
• Established XML standard for predictive model markup
(specifies the model, not an implementation of the model)
• Organized by Data Mining Group (DMG), since 1997, http://dmg.org/
• Open standards for Data Mining and Statistical models
• PMML producer: for applications that create predictive models
• PMML consumers: for application that read or consume models
PREDICTIVE MODEL MARKUP LANGUAGE (PMML)
Copyright 2014, Concurrent Inc.
“PMML is the leading standard for statistical and data mining models and supported by over 20
vendors and organizations.With PMML, it is easy to develop a model on one system using one
application and deploy the model on another system using another application.”
21. Copyright 2014, Concurrent Inc.Confidential21
• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classifiers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• SupportVector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
PMML MODEL COVERAGE
Copyright 2014, Concurrent Inc.
23. Copyright 2014, Concurrent Inc.Confidential23
BUILDING AND RUNNING PMML MODELS
Copyright 2014, Concurrent Inc.
Model
Producer
Data PMML
ModelExplore data and build model
using regression, clustering, etc.
Training
Scoring
New
Data
PMML model
Measure and improve model
Post
Processing
Model
Consumer
Data
Data
scores
PATTERN
ETL, prepare data
ETL, prepare data
24. Copyright 2014, Concurrent Inc.Confidential24
## train a RandomForest model!
f <- as.formula("as.factor(species) ~ .")
fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)
!
## test the model on the holdout test set!
print(fit)
!
out <- iris_full
out$predict <- predict(fit, out, type="class")
!
## export predicted labels to TSV!
write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)
!
!
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
PATTERN: CREATE A MODEL IN R
Copyright 2014, Concurrent Inc.
26. Copyright 2014, Concurrent Inc.Confidential26
public static void main(String[] args) throws RuntimeException {!
String inputPath = args[0];!
String classifyPath = args[1];!
!
Properties properties = new Properties();!
AppProps.setApplicationJarClass(properties, Main.class);!
HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);!
!
Tap inputTap = new Hfs(new TextDelimited(true, "t"), inputPath);!
Tap classifyTap = new Hfs(new TextDelimited(true, "t"), classifyPath);!
!
OptionParser optParser = new OptionParser();!
optParser.accepts("pmml").withRequiredArg();!
OptionSet options = optParser.parse(args);!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName("classify")!
.addSource("input", inputTap)!
.addSink("classify", classifyTap);!
!
if (options.hasArgument("pmml")) {!
String pmmlPath = (String) options.valuesOf("pmml").get(0);!
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput(new File(pmmlPath))!
.retainOnlyActiveIncomingFields()!
.setDefaultPredictedField(new Fields("predict", Double.class)); // default value if missing from the model!
flowDef.addAssemblyPlanner(pmmlPlanner);!
}!
!
Flow classifyFlow = flowConnector.connect(flowDef);!
classifyFlow.complete();!
}!
PATTERN: REUSE A MODEL
Copyright 2014, Concurrent Inc.
27. Copyright 2014, Concurrent Inc.Confidential27
## run an RF classifier at scale!
!
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml data/iris.rf.xml
!
## run an RF classifier at scale, assert regression test, measure confusion matrix!
!
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml
data/iris.rf.xml --measure out/measure!
!
!
## run a predictive model at scale, measure RMSE!
!
hadoop jar ./build/libs/patterndemo.jar data/iris.rf.tsv output/classify --pmml
data/iris.rf.xml --rmse out/measure
PATTERN: SCORE A MODEL
Copyright 2014, Concurrent Inc.
29. Experiments – comparing models
• Much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
• run multiple variants, then measure relative “lift”
• Concurrent runtime – tag and track models
!
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
30. ## load the "baseline" reference data!
dat_folder <- '.'!
data <- read.table(file=paste(dat_folder, "data/orders.tsv", sep="/"), sep="t", quote="",
na.strings="NULL", header=TRUE, encoding="UTF8")!
!
## split data into test and train sets!
set.seed(71)!
split_ratio <- 2/10!
split <- round(dim(data)[1] * split_ratio)!
data_tests <- data[1:split,]!
!
data_train <- data[(split + 1):dim(data)[1],]!
i <- colnames(data_train) == "order_id"!
j <- 1:length(i)!
data_train <- data_train[,-j[i]]!
!
## train a RandomForest model!
f <- as.formula("as.factor(label) ~ .")!
fit <- randomForest(f, data_train, ntree=25)!
!
## test the model on the holdout test set!
print(fit$importance)!
print(fit)!
!
## export RF model to PMML!
saveXML(pmml(fit), file=paste(dat_folder, "data/antifraud.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 13.12%!
Confusion matrix:!
0 1 class.error!
0 57 9 0.1363636!
1 12 82 0.1276596
32. In pattern/pattern-examples!
!
gradle clean jar!
!
hadoop dfs -rmr out/classify!
!
hadoop jar build/libs/pattern-examples-*.jar data/
sample.tsv out/classify --pmml data/
antifraud.rf.xml!
!
hadoop dfs -cat out/classify/part-*
Experiments – Random Forest model
33. Copyright 2014, Concurrent Inc.Confidential33
• Hierarchical Clustering
• K-Means Clustering
• Linear Regression
• Logistic Regression
• Random Forest
!
also, model chaining and general support for ensembles
!
algorithms can be added or extended based on customer use cases
PATTERN: ALGOS IMPLEMENTED
Copyright 2014, Concurrent Inc.
34. Copyright 2014, Concurrent Inc.Confidential34
BUILDING AND RUNNING PMML MODELS
Copyright 2014, Concurrent Inc.
Model
Producer
Data PMML
ModelExplore data and build model
using Regression, clustering, etc.
Training
Scoring
New
Data
PMML model
Measure and improve model
Post
Processing
Model
Consumer
Data
Data
scores
PATTERN
ETL, prepare data
ETL, prepare data
LINGUAL
LINGUAL
35. Copyright 2014, Concurrent Inc.Confidential35
PATTERN: SINGLE MODEL ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multiple departments to combine their workflow components into an
single integrated app (jar) – based on 100% open source – that can be managed by a single tool
LINGUAL (ANSI SQL) PATTERN (PMML)
ETL
Predictive
Model
Data
preparation
Data
Data
Data
CASCADING
decrease the project costs…
reduce licensing costs…
36. Copyright 2014, Concurrent Inc.Confidential36
•Can score data and run experiments at scale onto Hadoop
•Run different models using Ensembles
•In turn this allows to improve existing models and improve accuracy
PATTERN BENEFITS
Copyright 2014, Concurrent Inc.
37. Copyright 2014, Concurrent Inc.Confidential37
PATTERN: ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multiple departments to combine their workflow components into an single
integrated app (jar) – based on 100% open source – that can be managed by a single tool
LINGUAL (ANSI SQL) PATTERN (PMML)
ETL
Predictive
Model
Data
preparation
Data
Data
Data
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "etl" )!
.addSource( "data.source1", emplTap )!
.addSource( "data.source2", salesTap )!
.addSink( "results", resultsTap );!
!
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
!
flowDef.addAssemblyPlanner( sqlPlanner );!
!
!
38. Copyright 2014, Concurrent Inc.Confidential38
PATTERN: ARCHITECTURE
Copyright 2014, Concurrent Inc.
Cascading allows multiple departments to combine their workflow components into an single
integrated app (jar) – based on 100% open source – that can be managed by a single tool
LINGUAL (ANSI SQL) PATTERN (PMML)
ETL
Predictive
Model
Data
preparation
Data
Data
Data
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "classifier" )!
.addSource( "input", inputTap )!
.addSink( "classify", classifyTap );!
!
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlModel ) )!
.retainOnlyActiveIncomingFields();!
!
flowDef.addAssemblyPlanner( pmmlPlanner );!
!
!
40. Copyright 2014, Concurrent Inc.Confidential40
PATTERN: DEMO
Copyright 2014, Concurrent Inc.
1. Generate the model in R
2. Examine PMML MODEL
3. Write & Run Cascading app to score the model
41. Copyright 2014, Concurrent Inc.Confidential41
KEY TAKEAWAYS
Copyright 2014, Concurrent Inc.
Reuse existing learning models and investments to run
data scoring at scale
Leverage existing skill sets: Java, Scala, SQL, PMML, etc
Allow teams to collaborate on single model that can be
visualized, managed and monitored.