This document describes an autonomous analytics platform that allows users to analyze streaming data. The platform uses a unified big data technology stack including Spark, Cassandra, Hadoop, Kafka and Elasticsearch. It has a cloud-agnostic architecture and supports multiple machine learning frameworks. The platform includes a Domain Specific Language (DSL) that allows power users to create full data pipelines and analytics workflows with a few lines of code. It also includes a DSL Workbench for interactively building, editing and publishing analytical pipelines. Additionally, the document introduces "Auto Curious", which harnesses user interactions to autonomously discover insights and compose DSL commands through a question graph interface.
Building an intelligent big data application in 30 minutesClaudiu Barbura
Strata Barcelona presentation slides, a live demo of building an intelligent big data application from a web console. The tools and APIs behind are built on top of Spark, Spark SQL/Shark, Tachyon, Mesos, Cassandra, SolrCloud, iPython and include: ELT pipeline (ingestion and transformation), data warehouse explorer, export to NoSql and generated APIs, export to SolrCloud and generated APIs, predictive model building, training and publishing, dashboard UI, monitoring and instrumentation.
StreamAnalytix 2.0 is a real-time data analytics platform that offers multi-engine support, allowing users to choose the best streaming engine for their use case. It provides an easy drag-and-drop UI and supports technologies like Spark Streaming, Kafka, and Storm. StreamAnalytix enables enterprises to analyze and respond to events in real time at big data scale.
The document discusses modern data architectures. It presents conceptual models for data ingestion, storage, processing, and insights/actions. It compares traditional vs modern architectures. The modern architecture uses a data lake for storage and allows for on-demand analysis. It provides an example of how this could be implemented on Microsoft Azure using services like Azure Data Lake Storage, Azure Data Bricks, and Azure Data Warehouse. It also outlines common data management functions such as data governance, architecture, development, operations, and security.
StreamAnalytix 2.0 is a multi-engine streaming analytics platform that allows users to deploy multiple streaming engines depending on their use case requirements. It features an easy to use drag-and-drop UI, support for predictive analytics, machine learning, and real-time dashboards. The platform provides a level of abstraction that gives customers flexibility in choosing the best streaming engine for their needs.
Building scalable software requires designing it so that adding more hardware allows the software to utilize that hardware. Key considerations include avoiding contention over shared resources like CPU, disk, memory and network. Examples of scalable architectures include lock-free skiplist indexes, sharding or partitioning data across multiple machines, distributed query execution, and columnar data stores. Building for scale changes how software features are developed, requiring simple initial designs, leveraging existing resources, ensuring the right technical decisions through code reviews and technical leadership.
This document describes an autonomous analytics platform that allows users to analyze streaming data. The platform uses a unified big data technology stack including Spark, Cassandra, Hadoop, Kafka and Elasticsearch. It has a cloud-agnostic architecture and supports multiple machine learning frameworks. The platform includes a Domain Specific Language (DSL) that allows power users to create full data pipelines and analytics workflows with a few lines of code. It also includes a DSL Workbench for interactively building, editing and publishing analytical pipelines. Additionally, the document introduces "Auto Curious", which harnesses user interactions to autonomously discover insights and compose DSL commands through a question graph interface.
Building an intelligent big data application in 30 minutesClaudiu Barbura
Strata Barcelona presentation slides, a live demo of building an intelligent big data application from a web console. The tools and APIs behind are built on top of Spark, Spark SQL/Shark, Tachyon, Mesos, Cassandra, SolrCloud, iPython and include: ELT pipeline (ingestion and transformation), data warehouse explorer, export to NoSql and generated APIs, export to SolrCloud and generated APIs, predictive model building, training and publishing, dashboard UI, monitoring and instrumentation.
StreamAnalytix 2.0 is a real-time data analytics platform that offers multi-engine support, allowing users to choose the best streaming engine for their use case. It provides an easy drag-and-drop UI and supports technologies like Spark Streaming, Kafka, and Storm. StreamAnalytix enables enterprises to analyze and respond to events in real time at big data scale.
The document discusses modern data architectures. It presents conceptual models for data ingestion, storage, processing, and insights/actions. It compares traditional vs modern architectures. The modern architecture uses a data lake for storage and allows for on-demand analysis. It provides an example of how this could be implemented on Microsoft Azure using services like Azure Data Lake Storage, Azure Data Bricks, and Azure Data Warehouse. It also outlines common data management functions such as data governance, architecture, development, operations, and security.
StreamAnalytix 2.0 is a multi-engine streaming analytics platform that allows users to deploy multiple streaming engines depending on their use case requirements. It features an easy to use drag-and-drop UI, support for predictive analytics, machine learning, and real-time dashboards. The platform provides a level of abstraction that gives customers flexibility in choosing the best streaming engine for their needs.
Building scalable software requires designing it so that adding more hardware allows the software to utilize that hardware. Key considerations include avoiding contention over shared resources like CPU, disk, memory and network. Examples of scalable architectures include lock-free skiplist indexes, sharding or partitioning data across multiple machines, distributed query execution, and columnar data stores. Building for scale changes how software features are developed, requiring simple initial designs, leveraging existing resources, ensuring the right technical decisions through code reviews and technical leadership.
Big Data on EC2: Mashing Technology in the CloudGeorge Ang
This document discusses how a startup serving widgets on popular online publications scaled their infrastructure using Amazon Web Services to handle spikes in traffic from over 1 billion users sharing over 10 billion URLs. They used a hub-and-spoke architecture with components like Cascading, Amazon Elastic MapReduce, and AsterData to analyze user sharing patterns in a cost-effective and horizontally scalable way.
This document provides an overview of Azure Machine Learning including an introduction to the service, differences between Azure ML and SSAS Data Mining, demos of building and consuming ML models, and a quick introduction to other relevant Azure tools like Azure Stream Analytics, Azure Data Factory, and Azure Intelligent Systems Service. The presenter has experience with SQL Server BI, .NET, and is a BI developer but not a data scientist.
Rick Negrin discusses enabling real-time analytics for IoT applications. He describes how industries are increasingly needing real-time analytics due to trends like the on-demand economy and rise of IoT. He then outlines an architecture using Kafka for messaging and MemSQL for real-time analytics. MemSQL is presented as a SQL database that can ingest millions of events per second while analyzing petabytes of data. Finally, Negrin demonstrates an IoT application called MemEx that combines MemSQL, Kafka and Spark to enable predictive analytics on sensor data for supply chain management.
comparison of Excel add-ins and other solutions for implementing data mining or machine learning solutions on the Microsoft stack - includes coverage of XLMiner, Analysis Services Data Mining and PredixionSoftware
The document discusses building an end-to-end analytic solution in the cloud using Microsoft Azure tools, including ingesting data from various sources into Azure Data Factory, storing it in Azure Data Lake, transforming the data using U-SQL scripts in Azure Data Lake Analytics, developing predictive models with Azure Machine Learning Studio, and visualizing insights with Power BI. It provides examples of how each tool in the analytic lifecycle can be leveraged as part of an overall cloud-based analytics solution handling large volumes of data.
El camino a las Cloud Native Apps - Azure AIPlain Concepts
This document discusses different Azure AI services:
- Cognitive Services which provide pre-built machine learning algorithms to solve AI problems with little development needed. It highlights Computer Vision, Text Analytics, and other services.
- Azure Databricks which is an Apache Spark-based analytics platform optimized for Azure and designed for collaboration between data teams. It emphasizes easy infrastructure for big data and full Azure connectivity.
- Azure ML Workspace which is a tool to ease the entire machine learning process with experiment tracking, model versioning, predictive image creation and deployment.
Batchly is a cloud-based batch job processing service that abstracts away AWS cloud complexities and enables cost-effective processing of jobs. It uses machine learning to optimize infrastructure usage for cost and time. It provides a modern web portal with automatic scaling, supports Windows and Linux, and has features like live monitoring, progress tracking, and reporting. Microsoft Azure Batch enables running large parallel and HPC workloads in Azure at scale but requires direct management of Azure services and only supports Windows currently.
_Search? Made Simple: Elastic + App SearchElasticsearch
Get an in-depth look at Elastic App Search, the fastest and simplest way to add search to your internal or external application. Learn how to quickly deploy highly relevant and performant search in your app.
Data cleansing and data prep with synapse data flowsMark Kromer
This document contains links to resources about using Azure Synapse Analytics for data cleansing and preparation with Data Flows. It includes links to videos and documentation about removing null values, saving data profiler summary statistics, and using metadata functions in Azure Data Factory data flows.
This document is Vaibhav Sachdeva's resume. It outlines his education, internships, projects, technical trainings, extracurricular activities, skills, languages, and interests. He has a B.Tech in computer science from Manav Rachna University with a percentage of 77%. His internships include working on the FARMAP project using SQL Server Reporting Services and developing chat and photo gallery applications using Node.js, Express, and AWS services. He also has technical skills in AWS, HTML, CSS, SQL, JavaScript, and more.
Big Data Advanced Analytics on Microsoft Azure 201904Mark Tabladillo
This talk summarizes key points for big data advanced analytics on Microsoft Azure. First, there is a review of the major technologies. Second, there is a series of technology demos (focusing on VMs, Databricks and Azure ML Service). Third, there is some advice on using the Team Data Science Process to help plan projects. The deck has web resources recommended. This presentation was delivered at the Global Azure Bootcamp 2019, Atlanta GA location (Alpharetta Avalon).
During this presentation, after walking through a few ways to use MLflow on Azure directly, we'll cover how upcoming solutions from our group leverage MLflow for core functionality. BenchML is a new repository that aims to provide consumers of prebuilt ML endpoints visibility into the performance of each public offering for a given dataset as well as comparing results across multiple offerings. Using MLflow, BenchML is able to remain cloud-agnostic and offer a delightful local experience while leveraging the aforementioned integration to provide Azure users with a fully managed experience.
Speaker Bio: Akshaya is an engineer in the AI Platform at Microsoft, having released both GA versions of Azure Machine Learning over the years and the OSS repo MMLSpark. As the recent version of Azure ML pivoted to become more of an open platform rather than a managed product, his focus has shifted outward for open-source platform definitions for cloud-scale implementations and focused on MLflow for the Azure ML managed tracking store.
This talk was presented at the Bay Area MLflow Meetup at Databricks HQs in San Francisco: https://www.meetup.com/Bay-Area-MLflow/events/266614106/
Qubole is a cloud-native data platform that includes a native connector for Tableau to enable business intelligence and visual analytics on any cloud data lake with any file format. The Qubole connector delivers fast query response times for Tableau users through Presto on Qubole, while automatically managing cloud infrastructure based on user demand to prevent performance impacts or resource competition for simultaneous users. Tableau customers have flexibility to query unstructured or semi-structured data on any data lake, leveraging Presto's high performance without changing their normal workflow.
Kinesis is an AWS streaming data service that allows users to ingest, process, and analyze real-time streaming data. It offers three main services: Kinesis Data Streams to collect and process streaming data in real-time; Kinesis Data Firehose to load streaming data into AWS data stores; and Kinesis Data Analytics to process and analyze streaming data using SQL or Java for real-time analytics and alerts. Kinesis provides scalability, is fully managed, and enables users to build real-time applications on streaming data.
Finding new Customers using D&B and Excel Power QueryLynn Langit
Screencast which shows how to use Excel Power Query with D&B APIs to get company DUNS numbers and other company information from the Windows Azure Marketplace.
This document describes the key features of Azure ML Experimentation which allows users to conduct machine learning experiments by running distributed TensorFlow or CNTK training jobs, perform hyperparameter searches, capture run metrics and models, and compare runs through leaderboards. It also enables using popular IDEs, editors, notebooks and frameworks while running experiments on the cloud.
This document discusses the evolution of machine learning tools and services in the cloud, specifically on Microsoft Azure. It provides examples of machine learning frameworks, runtimes, and packages available over time on Azure including Azure ML (2015) and the Microsoft Cognitive Toolkit (CNTK) (2015). It also mentions the availability of GPU resources on Azure starting in 2016 and limitations to consider for the Azure ML service including restrictions on programming languages and a lack of debugging capabilities.
The document describes the features and capabilities of Visual Studio Tools for AI, an AI developer tool for training models and integrating AI into applications. It can create deep learning projects with frameworks like TensorFlow and CNTK, debug and iterate quickly in Visual Studio. It is integrated with Azure Machine Learning for management of experiments and models, and can scale out training with Azure Batch AI. The tool allows monitoring of training, visualization with TensorBoard, and generation of code from trained models.
Companion to presentation "From Evergreen to Edible" / Rooting DC 2017; resources for Making an eighth of an acre including house edible instead of the pre-planted evergreens
La maestra Fernanda López realizó una actividad de caricaturas periodísticas con los estudiantes de segundo grado. Los alumnos disfrutaron el tema y mostraron su creatividad al elaborar las caricaturas, la mayoría de políticos. Fue una actividad exitosa que permitió a los estudiantes expresar sus ideas a través de este medio artístico.
Big Data on EC2: Mashing Technology in the CloudGeorge Ang
This document discusses how a startup serving widgets on popular online publications scaled their infrastructure using Amazon Web Services to handle spikes in traffic from over 1 billion users sharing over 10 billion URLs. They used a hub-and-spoke architecture with components like Cascading, Amazon Elastic MapReduce, and AsterData to analyze user sharing patterns in a cost-effective and horizontally scalable way.
This document provides an overview of Azure Machine Learning including an introduction to the service, differences between Azure ML and SSAS Data Mining, demos of building and consuming ML models, and a quick introduction to other relevant Azure tools like Azure Stream Analytics, Azure Data Factory, and Azure Intelligent Systems Service. The presenter has experience with SQL Server BI, .NET, and is a BI developer but not a data scientist.
Rick Negrin discusses enabling real-time analytics for IoT applications. He describes how industries are increasingly needing real-time analytics due to trends like the on-demand economy and rise of IoT. He then outlines an architecture using Kafka for messaging and MemSQL for real-time analytics. MemSQL is presented as a SQL database that can ingest millions of events per second while analyzing petabytes of data. Finally, Negrin demonstrates an IoT application called MemEx that combines MemSQL, Kafka and Spark to enable predictive analytics on sensor data for supply chain management.
comparison of Excel add-ins and other solutions for implementing data mining or machine learning solutions on the Microsoft stack - includes coverage of XLMiner, Analysis Services Data Mining and PredixionSoftware
The document discusses building an end-to-end analytic solution in the cloud using Microsoft Azure tools, including ingesting data from various sources into Azure Data Factory, storing it in Azure Data Lake, transforming the data using U-SQL scripts in Azure Data Lake Analytics, developing predictive models with Azure Machine Learning Studio, and visualizing insights with Power BI. It provides examples of how each tool in the analytic lifecycle can be leveraged as part of an overall cloud-based analytics solution handling large volumes of data.
El camino a las Cloud Native Apps - Azure AIPlain Concepts
This document discusses different Azure AI services:
- Cognitive Services which provide pre-built machine learning algorithms to solve AI problems with little development needed. It highlights Computer Vision, Text Analytics, and other services.
- Azure Databricks which is an Apache Spark-based analytics platform optimized for Azure and designed for collaboration between data teams. It emphasizes easy infrastructure for big data and full Azure connectivity.
- Azure ML Workspace which is a tool to ease the entire machine learning process with experiment tracking, model versioning, predictive image creation and deployment.
Batchly is a cloud-based batch job processing service that abstracts away AWS cloud complexities and enables cost-effective processing of jobs. It uses machine learning to optimize infrastructure usage for cost and time. It provides a modern web portal with automatic scaling, supports Windows and Linux, and has features like live monitoring, progress tracking, and reporting. Microsoft Azure Batch enables running large parallel and HPC workloads in Azure at scale but requires direct management of Azure services and only supports Windows currently.
_Search? Made Simple: Elastic + App SearchElasticsearch
Get an in-depth look at Elastic App Search, the fastest and simplest way to add search to your internal or external application. Learn how to quickly deploy highly relevant and performant search in your app.
Data cleansing and data prep with synapse data flowsMark Kromer
This document contains links to resources about using Azure Synapse Analytics for data cleansing and preparation with Data Flows. It includes links to videos and documentation about removing null values, saving data profiler summary statistics, and using metadata functions in Azure Data Factory data flows.
This document is Vaibhav Sachdeva's resume. It outlines his education, internships, projects, technical trainings, extracurricular activities, skills, languages, and interests. He has a B.Tech in computer science from Manav Rachna University with a percentage of 77%. His internships include working on the FARMAP project using SQL Server Reporting Services and developing chat and photo gallery applications using Node.js, Express, and AWS services. He also has technical skills in AWS, HTML, CSS, SQL, JavaScript, and more.
Big Data Advanced Analytics on Microsoft Azure 201904Mark Tabladillo
This talk summarizes key points for big data advanced analytics on Microsoft Azure. First, there is a review of the major technologies. Second, there is a series of technology demos (focusing on VMs, Databricks and Azure ML Service). Third, there is some advice on using the Team Data Science Process to help plan projects. The deck has web resources recommended. This presentation was delivered at the Global Azure Bootcamp 2019, Atlanta GA location (Alpharetta Avalon).
During this presentation, after walking through a few ways to use MLflow on Azure directly, we'll cover how upcoming solutions from our group leverage MLflow for core functionality. BenchML is a new repository that aims to provide consumers of prebuilt ML endpoints visibility into the performance of each public offering for a given dataset as well as comparing results across multiple offerings. Using MLflow, BenchML is able to remain cloud-agnostic and offer a delightful local experience while leveraging the aforementioned integration to provide Azure users with a fully managed experience.
Speaker Bio: Akshaya is an engineer in the AI Platform at Microsoft, having released both GA versions of Azure Machine Learning over the years and the OSS repo MMLSpark. As the recent version of Azure ML pivoted to become more of an open platform rather than a managed product, his focus has shifted outward for open-source platform definitions for cloud-scale implementations and focused on MLflow for the Azure ML managed tracking store.
This talk was presented at the Bay Area MLflow Meetup at Databricks HQs in San Francisco: https://www.meetup.com/Bay-Area-MLflow/events/266614106/
Qubole is a cloud-native data platform that includes a native connector for Tableau to enable business intelligence and visual analytics on any cloud data lake with any file format. The Qubole connector delivers fast query response times for Tableau users through Presto on Qubole, while automatically managing cloud infrastructure based on user demand to prevent performance impacts or resource competition for simultaneous users. Tableau customers have flexibility to query unstructured or semi-structured data on any data lake, leveraging Presto's high performance without changing their normal workflow.
Kinesis is an AWS streaming data service that allows users to ingest, process, and analyze real-time streaming data. It offers three main services: Kinesis Data Streams to collect and process streaming data in real-time; Kinesis Data Firehose to load streaming data into AWS data stores; and Kinesis Data Analytics to process and analyze streaming data using SQL or Java for real-time analytics and alerts. Kinesis provides scalability, is fully managed, and enables users to build real-time applications on streaming data.
Finding new Customers using D&B and Excel Power QueryLynn Langit
Screencast which shows how to use Excel Power Query with D&B APIs to get company DUNS numbers and other company information from the Windows Azure Marketplace.
This document describes the key features of Azure ML Experimentation which allows users to conduct machine learning experiments by running distributed TensorFlow or CNTK training jobs, perform hyperparameter searches, capture run metrics and models, and compare runs through leaderboards. It also enables using popular IDEs, editors, notebooks and frameworks while running experiments on the cloud.
This document discusses the evolution of machine learning tools and services in the cloud, specifically on Microsoft Azure. It provides examples of machine learning frameworks, runtimes, and packages available over time on Azure including Azure ML (2015) and the Microsoft Cognitive Toolkit (CNTK) (2015). It also mentions the availability of GPU resources on Azure starting in 2016 and limitations to consider for the Azure ML service including restrictions on programming languages and a lack of debugging capabilities.
The document describes the features and capabilities of Visual Studio Tools for AI, an AI developer tool for training models and integrating AI into applications. It can create deep learning projects with frameworks like TensorFlow and CNTK, debug and iterate quickly in Visual Studio. It is integrated with Azure Machine Learning for management of experiments and models, and can scale out training with Azure Batch AI. The tool allows monitoring of training, visualization with TensorBoard, and generation of code from trained models.
Companion to presentation "From Evergreen to Edible" / Rooting DC 2017; resources for Making an eighth of an acre including house edible instead of the pre-planted evergreens
La maestra Fernanda López realizó una actividad de caricaturas periodísticas con los estudiantes de segundo grado. Los alumnos disfrutaron el tema y mostraron su creatividad al elaborar las caricaturas, la mayoría de políticos. Fue una actividad exitosa que permitió a los estudiantes expresar sus ideas a través de este medio artístico.
This document provides instructions for using different layout algorithms in Gephi to visualize networks. It discusses installing layout plugins, importing a graph file, running initial layouts like Force Atlas to view the network structure, adjusting layout properties, and using different layouts to emphasize various network features. Various layout algorithms like Force Atlas, Fruchterman-Reingold, OpenOrd, and GeoLayout are introduced.
This document discusses Apache Cassandra, a distributed database management system. It provides an overview of Cassandra's features such as linear scalability, high performance and availability. The document also discusses how Cassandra addresses big data challenges through its integration of analytics and real-time capabilities. Several companies that use Cassandra share how it meets their needs for scalability, high performance and lower total cost of ownership compared to alternative solutions.
This document provides a tutorial on how to use various visualization settings in Gephi software. It discusses how to zoom and pan, select nodes, change node and edge colors, toggle 3D view, adjust label sizes and colors, and display node and edge attributes. The tutorial uses a sample airline routes dataset and provides tips on layout and text formatting to improve network visualization.
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
Netflix is migrating its datacenter infrastructure from Oracle databases to a globally distributed Apache Cassandra database on AWS. This will allow Netflix to scale more easily and deploy new features faster without being limited by the capacity of its own datacenters. The migration involves transitionally replicating data between Oracle and AWS services like SimpleDB while new services are deployed directly on Cassandra. This will cut Netflix's dependence on its existing datacenters and allow it to fully leverage the elasticity of the public cloud.
The document describes the Social Informatics Data Grid (SIDGrid), which aims to:
1) Integrate heterogeneous datasets over time, place, and type through a shared data and service interface and common problems/theories.
2) Develop tools for collecting, storing, retrieving, annotating, and analyzing synchronized multi-modal data on computational grids.
3) The SIDGrid architecture allows streaming of video, audio and time series data across distributed datasets using time alignment, database, and grid computing standards. It provides search and analysis tools to browse over 4,000 projects containing various media files.
The document discusses the role and responsibilities of a data architect. It provides information on the high demand and salaries for data architects, which can be over $200,000 at companies like Microsoft. The summary also outlines some of the key technical skills required for the role, including strong data modeling abilities, knowledge of databases, ETL tools, analytics dashboards, and programming languages like SQL, Python and R. Business skills like communication and presenting complex concepts are also important.
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
The document introduces a workshop on big data tools and MongoDB, discusses how MediaGlu uses big data for advertising by tracking user paths across different channels, and outlines an agenda covering MongoDB fundamentals, running MongoDB, and labs on shell commands, aggregation, replication, and sharding.
Researchers used deep learning techniques like ResNet and data augmentation to improve the accuracy of detecting snow leopards from 63.4% to 90%. They used transfer learning on a ResNet model to extract features from images, then trained a logistic regression classifier on those features to detect snow leopards. They also averaged predictions from multiple images and doubled their training data by flipping images horizontally. This helped improve the model's ability to identify snow leopards in photos.
The document discusses using interactive event graphs and Spark to scale security investigations. It describes how Graphistry uses event graphs visualized through GPUs to provide scalable views of relationships and patterns across billions of events. An example is given of using this approach for incident response by constructing an event graph to analyze the spread of a botnet outbreak.
Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...Lucas Jellema
Overview of the highlights, main themes and grand announcements during Oracle OpenWorld 2016. Cloud, Big Data, Machine Learning, Infrastructure, raging against AWS and the Oracle future strategy are the chief topics.
The AMIS Team reviewed Oracle OpenWorld 2016 in Nieuwegein, Netherlands on October 13th. Some key themes discussed included Oracle focusing on growing its Infrastructure as a Service capabilities to better compete with Amazon Web Services, introducing new IaaS options that provide high performance networking and storage, and deploying IaaS both on-premises and in Oracle's public cloud. The document also covered Oracle expanding its Platform as a Service and Software as a Service offerings, including evolving traditional on-premises applications like E-Business Suite for deployment in Oracle's public cloud.
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingMatthew Vaughn
The document discusses tools for responsible and collaborative computing including clouds, clusters, containers, and Agave. It provides an overview of these tools and how they can help address challenges of big data including reproducibility, collaboration, and portability, while also noting potential issues like management complexity. Containers are presented as a way to compartmentalize code, eliminate complexity, and introduce reproducibility to scientific workflows.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
This document discusses the potential for "Science as a Service" by leveraging on-demand computing capabilities. It notes that most research labs have limited resources, so automation and outsourcing are needed to apply sophisticated methods to larger datasets. The author proposes building a "discovery cloud" by identifying time-consuming research activities that can be automated and delivered as software/platform/infrastructure as a service. This would help accelerate scientific discovery. Globus is highlighted as an example of a platform providing data management services using a software as a service model.
This document provides an introduction and overview of Spark:
- Spark is an open-source in-memory data processing engine that can handle large datasets across clusters of computers using an API in Scala, Python, or R.
- IBM is heavily committed to Spark, contributing the most code and fixing the most issues reported by other organizations to continually improve the full analytics stack.
- An example is presented on using Spark to predict hospital readmissions from diabetes patient data, obtaining AUC scores comparable to other published models.
This document discusses building a "discovery cloud" to accelerate scientific discovery through on-demand computing. It proposes identifying time-consuming research activities that can be automated and outsourced as software-as-a-service. This would achieve economies of scale through leveraging infrastructure-as-a-service. The goal is to create a great user experience for scientific data management similar to consumer services like Dropbox. It also discusses integrating services like Globus for data management and Galaxy for analysis to provide flexible and scalable genomics analysis.
Automated machine learning (automated ML) automates feature engineering, algorithm and hyperparameter selection to find the best model for your data. The mission: Enable automated building of machine learning with the goal of accelerating, democratizing and scaling AI. This presentation covers some recent announcements of technologies related to Automated ML, and especially for Azure. The demonstrations focus on Python with Azure ML Service and Azure Databricks.
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
The document discusses Sparkle, a solution built by Comcast to address challenges in processing massive amounts of data and enabling data science workflows at scale. Sparkle is a centralized processing system with SQL and machine learning capabilities that is highly scalable and accessible via a REST API. It is used by Comcast to power various use cases including churn modeling, price elasticity analysis, and direct mail campaign optimization.
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
What’s New in the Berkeley Data Analytics StackTuri, Inc.
The document discusses the Berkeley Data Analytics Stack (BDAS) developed by UC Berkeley's AMPLab. It summarizes the key components of the BDAS including Spark, Mesos, Tachyon, MLlib, and Velox. It describes how the BDAS provides a unified platform for batch, iterative, and streaming analytics using in-memory techniques. It also discusses recent developments like KeystoneML/ML Pipelines for scalable machine learning and SampleClean for human-in-the-loop analytics. The goal is to make it easier to build and deploy advanced analytics applications on large datasets.
NoSQL databases like MongoDB, Elasticsearch, and Cassandra are synonymous with scalability, search, and developer agility. But there’s a downside...having to give up the ease and comfort of SQL.
Or do you?
Join this webcast to learn how the newest databases, like CrateDB and CockroachDB deliver the benefits of NoSQL with the ease of SQL by building SQL engines on top of custom NoSQL technology stacks. Database industry veteran Andy Ellicott, who helped launch Vertica, VoltDB, Cloudant, and now with Crate.io, will provide a no-BS view of current DBMS architectures and predictions for the future of data.
If you’re a DBMS user, this webcast will help you make sense of a very crowded DBMS market and make better-informed decisions for your new tech stacks.
Presentation at Data Days Texas 2015, in Austin. A deep dive into Spark, Tachyon and Mesos code as well as Atigeo's open source contributions, Jaws, a Spark SQL rest server and a Spark Job Server.
This document outlines the agenda for a Tachyon Meetup in San Francisco. The agenda includes discussing the xPatterns architecture, BDAS++, demos of Tachyon internals and APIs, and lessons learned. BDAS++ refers to enhancements made to Tachyon to support Spark SQL and the Spark job server. Lessons learned focus on issues discovered like partial in-memory file storage bugs and best practices for Tachyon usage.
Building a big data intelligent application on top of xPatterns using tools that leverage Spark, Shark, Mesos, Tachyon and Cassandra. Jaws, open sourcing our own spark sql restful service and our own contributions to the Spark and Mesos projects, lessons learned
This document outlines the agenda and content for a presentation on xPatterns, a tool that provides APIs and tools for ingesting, transforming, querying and exporting large datasets on Apache Spark, Shark, Tachyon and Mesos. The presentation demonstrates how xPatterns has evolved its infrastructure to leverage these big data technologies for improved performance, including distributed data ingestion, transformation APIs, an interactive Shark query server, and exporting data to NoSQL databases. It also provides examples of how xPatterns has been used to build applications on large healthcare datasets.
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
The document discusses lessons learned from embedding Cassandra in the xPatterns big data analytics platform. It provides an agenda that includes discussing Cassandra usage in xPatterns, the necessary developments like data modeling optimizations, robust REST APIs, geo-replication, and a demo of exporting to NoSQL APIs. Key lessons learned since Cassandra versions 0.6 to 2.0.6 are also summarized, such as the need for consistent clocks, reducing column families, and monitoring.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
2. • Architect and Dev Mgr at ubix.ai … data science platform
• Infrastructure & real-time services —-> Data Science at scale
• xPatterns Big Data Platform (Spark, Mesos, Tachyon, Cassandra)
• SeaScale my first ever meetup!
• Strata, Spark, C* summits & local meetups
About Me
3. • Ubix Data Eng & Science Platform Architecture
• High dimensional sparse feature spaces
• OKA (OverKill Analytics) and Composite Modelling
• (Kaggle)Outbrain Click Prediction: demo in DSL Workbench
• pymap deep dive: distributed scikit-learn through Spark
• python injection into DSL: pySpark scala JVM interop
• Q&A
Agenda
4. Data Eng & Science Platform: “Engine”
Unified big data technology stack (spark, cassandra, hadoop, kafka, es..)
Cloud agnostic architecture
Universal predictive interface (MlLib, ML Pipeline, VW, scikit-learn, R, H20 … TF)
Extensible and integration via fluent and expressive API (DSL)
Enterprise grade: scalability, performance, high availability, geo-replication,
resilience, security, manageability, interoperability, testability
5.
6.
7. • high dimensional feature engineering demanda sparse representation
• spark and scipy support vs ubix DSL: compress-sparse, merge-sparse, expand-
sparse, filter-sparse, load sparse (libsvm format)
• sparse vs dense: native input to mllib, spark.ml, scikit-learn algos
• exceptions: spark 1.6 mllib’s kmeans, gmm, RF (breeze linear algebra or … slow)
• feature (2-way) encoding + vocabulary extraction (error analysis, importance)
• Dimensionality Reduction via Feature Selection (ChiSquare) and Hashing (text)
High dimensional sparse feature spaces
8. • OKA: “design philosophy for predictive models favors volume over precision, utility over
elegance, and CPU over IQ. … brute force attack on data science, compromise fine-tuning
• Alternative to Dimensionality reduction - train on full sparse feature space!
• Composite Modeling = managing part models as one ensemble
• distributed scikit-learn/TF/VW models -> prediction table output for averaging, voting
• unsupervised learning output -> input supervised learning (clustering + ensembling)`
• dimensionality reduction or building semantically different models within clusters
• OKA + Comp: larger feature spaces (lower variance in parts -> higher bias in part models)
OKA (OverKill Analytics) & Composite Modelling
9. • Outbrain: content discovery platform … 250 billion personalized recommendations/month
• Kaggle: predict which recommended content each user will click?
• sample of users’ page views and clicks (14 days) .. sets of content recommendations
served to a specific user in a specific context +
• document metadata: mentioned entities (person, organization, location), a taxonomy of
categories, the topics mentioned, and the publisher.
• 2 Billion page views, 16,900,000 clicks of 700 Million unique users, across 560 sites
Outbrain Click Prediction
10. • primitives for model management (model + metadata)
• optimizations for clustering + composite modeling techniques
• compute partition size/count to avoid OOM (simple with static allocation of resources
(Mesos/Coarse Grained or YARN))
• wrapped pySpark (jvmContext) through gateway servercontext (JavaGateway)
• python-scala interop through cached temp tables (registerTempTable)
pymap - distributed python