Big SQL is a SQL engine for Hadoop that excels at performance and scalability at high concurrency. Big SQL complements and integrates with Apache Hive for both data and metadata. An architecture that separates compute from storage allows Big SQL to support multiple open data formats natively. Until recently, Parquet provided a significant performance advantage over other data formats for SQL on Hadoop. The landscape changed when ORC became a top level Apache project independent from Hive. Gone were the days of reading ORC files using slow, single-row-at-a-time Hive Serdes. The new vectorized APIs in the Apache ORC libraries make it possible to ingest ORC data at blazing speed. This talk is about the journey leading to ORC taking the crown of best performing data format for Big SQL away from Parquet. We'll have a look under the hood at the architecture of Big SQL ORC readers, and how to tune them. We'll share lessons learned in walking the fine line between maximizing performance at scale and avoiding dreaded Java OOMs . You'll learn the techniques that SQL engines use for fast data ingestion, so that you can leverage the full potential of Apache ORC in any application.
Speaker:
Gustavo Arocena, Big Data Architect, IBM
In this webinar, Postman Developer Educator Sue Smith walks you through the basics of the Postman API Platform and what you can do with it. Topics include:
- How to get started with Postman
- Key tips and tricks every student should know, including how to build basic request configurations and which Postman features are most likely to be useful to you
- Best practices for using Postman to further your education
AI and ML Skills for the Testing World TutorialTariq King
Software continues to revolutionize the world, impacting nearly every aspect of our work, family, and personal life. Artificial intelligence (AI) and machine learning (ML) are playing key roles in this revolution through improvements in search results, recommendations, forecasts, and other predictions. AI and ML technologies are being used in platforms for digital assistants, home entertainment, medical diagnosis, customer support, and autonomous vehicles. Testing practitioners are recognizing the potential for advances in AI and ML to be leveraged for automated testing—an area that still requires significant manual effort. Tariq King and Jason Arbon introduce you to the world of AI for software testing. Learn the fundamentals behind autonomous and intelligent agents, ML approaches including Bayesian networks, decision tree learning, neural networks, and reinforcement learning. Discover how to apply these techniques to common testing tasks such as identifying testable features, generating test flows, and detecting erroneous states.
SonarQube - Should I Stay or Should I Go ? Geeks Anonymes
...by Jérémie Fays, 3 june 2015.
Ever considered monitoring your code quality ? SonarQube is certainly a good candidate for that, and an open source one ! This presentation explains shortly the metrics you can track using SonarQube, and how it has been implemented at the University of Liege TTO.
It is not to complicated to keep new project with good code quality for half year. Maybe, for one year. But what if team works on some project for years? Or even ”better”: you need to support and grow large project after another team. Presentation describes Continuous Inspection, main measures of code quality that will make your life better, continuous inspection and how to cook it with SonarQube.
Postman Webinar: “Continuous Testing with Postman”Postman
In this webinar, Postman Developer Advocate Joyce Lin and Engineering Manager Trent McCann discuss automating your tests with Postman while walking you through some advanced testing workflows. Topics include:
- Run tests locally using Postman’s Collection Runner
- Automate testing as part of your continuous integration (CI) pipeline using Postman’s Newman (a command-line collection runner for Postman)
- Run health and security checks using Postman monitors
In this webinar, Postman Developer Educator Sue Smith walks you through the basics of the Postman API Platform and what you can do with it. Topics include:
- How to get started with Postman
- Key tips and tricks every student should know, including how to build basic request configurations and which Postman features are most likely to be useful to you
- Best practices for using Postman to further your education
AI and ML Skills for the Testing World TutorialTariq King
Software continues to revolutionize the world, impacting nearly every aspect of our work, family, and personal life. Artificial intelligence (AI) and machine learning (ML) are playing key roles in this revolution through improvements in search results, recommendations, forecasts, and other predictions. AI and ML technologies are being used in platforms for digital assistants, home entertainment, medical diagnosis, customer support, and autonomous vehicles. Testing practitioners are recognizing the potential for advances in AI and ML to be leveraged for automated testing—an area that still requires significant manual effort. Tariq King and Jason Arbon introduce you to the world of AI for software testing. Learn the fundamentals behind autonomous and intelligent agents, ML approaches including Bayesian networks, decision tree learning, neural networks, and reinforcement learning. Discover how to apply these techniques to common testing tasks such as identifying testable features, generating test flows, and detecting erroneous states.
SonarQube - Should I Stay or Should I Go ? Geeks Anonymes
...by Jérémie Fays, 3 june 2015.
Ever considered monitoring your code quality ? SonarQube is certainly a good candidate for that, and an open source one ! This presentation explains shortly the metrics you can track using SonarQube, and how it has been implemented at the University of Liege TTO.
It is not to complicated to keep new project with good code quality for half year. Maybe, for one year. But what if team works on some project for years? Or even ”better”: you need to support and grow large project after another team. Presentation describes Continuous Inspection, main measures of code quality that will make your life better, continuous inspection and how to cook it with SonarQube.
Postman Webinar: “Continuous Testing with Postman”Postman
In this webinar, Postman Developer Advocate Joyce Lin and Engineering Manager Trent McCann discuss automating your tests with Postman while walking you through some advanced testing workflows. Topics include:
- Run tests locally using Postman’s Collection Runner
- Automate testing as part of your continuous integration (CI) pipeline using Postman’s Newman (a command-line collection runner for Postman)
- Run health and security checks using Postman monitors
DevOps concepts, tools, and technologies v1.0Mohamed Taman
DevOps is not a tool or technology; it is an approach or culture that makes things better.
This session describes in detail how DevOps solves different problems of the traditional
application delivery cycle.
It also describes how it can be used to make development and operations teams efficient and effective in order to make time to market faster by improving culture. It also explains key concepts essential for evolving DevOps culture.
In this session, we will cover the following topics:
1- Understanding the DevOps movement
2- The DevOps lifecycle—it's all about “continuous”
3- Continuous integration
4- Configuration management
5- Continuous delivery/continuous deployment
6- Continuous monitoring
7- Continuous feedback
8- Tools and technologies
One of the biggest problems with code reviews is that they often derail developer productivity. Learn about the essentials of code reviews, where they are today, and where they can be using AI/ML technologies. With machine learning technology, code quality can be improved, and developers can focus on invention, rather than remediation.
A one-hour, intermediate-level Postman learning session geared specifically for developers and testers. We’ll walk you through strategies and tactics for debugging more efficiently. Whether you're just exploring new APIs or developing rigorous API workflows, learn how to work smarter while debugging.
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
DevOps concepts, tools, and technologies v1.0Mohamed Taman
DevOps is not a tool or technology; it is an approach or culture that makes things better.
This session describes in detail how DevOps solves different problems of the traditional
application delivery cycle.
It also describes how it can be used to make development and operations teams efficient and effective in order to make time to market faster by improving culture. It also explains key concepts essential for evolving DevOps culture.
In this session, we will cover the following topics:
1- Understanding the DevOps movement
2- The DevOps lifecycle—it's all about “continuous”
3- Continuous integration
4- Configuration management
5- Continuous delivery/continuous deployment
6- Continuous monitoring
7- Continuous feedback
8- Tools and technologies
One of the biggest problems with code reviews is that they often derail developer productivity. Learn about the essentials of code reviews, where they are today, and where they can be using AI/ML technologies. With machine learning technology, code quality can be improved, and developers can focus on invention, rather than remediation.
A one-hour, intermediate-level Postman learning session geared specifically for developers and testers. We’ll walk you through strategies and tactics for debugging more efficiently. Whether you're just exploring new APIs or developing rigorous API workflows, learn how to work smarter while debugging.
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
Benchmarking Hadoop - Which hadoop sql engine leads the herdGord Sissons
Stewart Tate (tates@us.ibm.com), a key architect behind the industry's first ever Hadoop-DS benchmark at 30TB scale, describes the benchmark and comparative testing between IBM, Cloudera Impala and Hortonworks Hive
Spark working with a Cloud IDE: Notebook/Shiny AppsData Con LA
Abstract:-
The Problem: Energy inefficiency within public/private buildings in the City of New York.
The Goal: Take meter(Sensor) data, solve the inefficiencies through better insights.
The Solution: Visualization and Reporting through the Shiny App to gain knowledge in past, and present usage patterns. In addition to those patterns, compare and gain insights/predictions on energy usage.
Spark's Dataframes and RDD's will be used in concert with panda (library) to clean and model/prepare data for the R Shiny App. The message to convey in this meetup discussion is to show the capabilities of Spark while using DSX and RStudio/Shiny App to create visualization/reporting that will be able to give insights to the end user.
There are a few techniques that we will present in this notebook with both modeling and ML: Linear Regression, K-Means clustering for identifying inefficient buildings, (Statistical) Classification Modeling, followed by a confusion matrix (error matrices).
Bio:-
Thomas Liakos has been an Open Source Systems Engineer for 11 years and he has 8 years of experience in Cloud and hybrid environments. Prior to IBM Thomas was at Gem.co: Sr. Systems Architect. and CrowdStrike: DevOps / Systems Engineer - Cloud Operations. Thomas has expertise in Spark, Python, Systems and Configuration Management, Architecture, Data Warehousing, and Data Engineering.
Using GPUs to Achieve Massive Parallelism in Java 8Dev_Events
Adam Roberts, IBM Spark Team Lead – Runtimes, IBM Cloud
Graphic processing units (GPUs) are not limited to traditional scene rendering tasks. They can play a
huge role in accelerating applications that have a large number of parallelizable tasks.
Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical
computing applications such as big data and analytics workloads, through both explicit GPU
programming and letting the Java JIT compiler transparently off-load work to the GPU.
This presentation covers the principles and considerations for GPU programming from Java and looks at
the software stack and developer tools available. After this talk you will be ready to extract the full
power of GPUs from your own application. We will present a demo showing GPU acceleration and
discuss what is coming in the future.
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...nick_garrod
InterConnect Session 2546, Solving Performance Problems in CICS using CICS Performance Analyzer. CICS performing well is critical to your Enterprise. Knowing how to solve performance problems is just as critical. This session will concentrate on the most common performance problems seen in the Level 2 Service area. Monitoring and Statistic data will be used to show what to look for and how to avoid these common problems. CICS Performance Analyzer will be used as the primary tool to format and present the data needed to solve these performance issues.
This session will describe how CICS TS v5.1 can quickly and simply support the creation of modern Mobile Ready interfaces to existing applications. The session will introduce the key technologies including the use of Liberty technology in CICS TS. We will work through a simple scenario to demonstrate the key points. The session will cover the core supporting technologies include in CICS TS v5.1 as well as the Dynamic Scripting Feature Pack and content included in the CICS TS v5.2 Open Beta.
Enabling a hardware accelerated deep learning data science experience for Apa...DataWorks Summit
Deep learning techniques are finding significant commercial success in a wide variety of industries. Large unstructured data sets such as images, videos, speech and text are great for deep learning, but impose a lot of demands on computing resources. New types of hardware architectures such as GPUs and faster interconnects (e.g. NVLink), RDMA capable networking interface from Mellanox available on OpenPOWER and IBM POWER systems are enabling practical speedups for deep learning. Data Scientists can intuitively incorporate deep learning capabilities on accelerated hardware using open source components such as Jupyter and Zeppelin notebooks, RStudio, Spark, Python, Docker, and Kubernetes with IBM PowerAI. Jupyter and Apache Zeppelin integrate well with Apache Spark and Hadoop using the Apache Livy project. This session will show some deep learning build and deploy steps using Tensorflow and Caffe in Docker containers running in a hardware accelerated private cloud container service. This session will also show system architectures and best practices for deployments on accelerated hardware. INDRAJIT PODDAR, Senior Technical Staff Member, IBM
Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...Informatik Aktuell
Die DB2 In-Memory-Technologie (BLU Acceleration) beschleunigt analytische Abfragen um ein Vielfaches. Dies gilt sowohl für OLAP als auch für Analysen in einer OLTP-Umgebung. Durch die Integration in den DB2-Kernel ab Version 10.5 sind bestehende Datenbanksysteme bereits für die In-Memory Nutzung vorbereitet. Dadurch ist eine flexible und schnelle Umsetzung gewährleistet.
“Liberté, Égalité, Fraternité” (Liberty, Equality, Fraternity), is the slogan of France, coined around the time of the French Revolution. It also seems a pretty appropriate slogan for the mini revolution that is happening right now with CICS and WebSphere. The Liberty profile is a highly composable and dynamic application server runtime environment that is shipped as a part of both WebSphere and CICS. This session will introduce Liberty in CICS, compare the capability with WebSphere (note the ‘equality’ word) and discuss how these new Liberty applications can interact with and support the established fraternity of existing CICS applications that run your core business.
Similar to Ingesting Data at Blazing Speed Using Apache Orc (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on: