Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Fast, stable and scalable true radix sorting with Matt Dowle at useR! AalborgSri Ambati
This document discusses the history and implementation of radix sorting in R. It begins by describing how Tom Short proposed using sort.list(x, method="radix") in 2009 to speed up sorting in data.table. This led to Matt Dowle implementing a true radix sorting algorithm called forderv() in data.table. Radix sorting provides dramatically faster sorting for large integer vectors compared to the default R sorting methods. The document then provides technical details on how radix sorting works and how it was optimized in data.table to support larger vectors and data types. In the end, it encourages testing the forderv() radix sorting function in data.table.
Time Series Meetup: Virtual Edition | July 2020InfluxData
This document summarizes an online meetup about anomaly detection using Median Absolute Deviation (MAD) in Flux. It includes an agenda with introductions, a talk on MAD, and time for Q&A. The talk explains how MAD works through numerical examples and uses common Flux functions like group(), drop(), median(), map(), and join(). It also provides instructions for writing custom Flux functions and packages and contributing them to the Flux codebase through testing and compilation.
Vasia Kalavri – Training: Gelly School Flink Forward
- Gelly is a graph processing library built on Apache Flink that provides APIs for Java and Scala to work with graphs and perform graph algorithms
- It allows seamless integration of graph-based and record-based analysis by mixing the Gelly and Flink DataSet APIs
- Common graph algorithms like connected components, PageRank, and similarity recommendations are included in the library
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
The document discusses Bloom filters, which are compact data structures used to represent sets probabilistically. Bloom filters allow membership queries to determine if an element is in a set, but may return false positives. They provide a more space-efficient alternative to other data structures like hash tables. The key properties of Bloom filters are that they require less memory than other solutions, allow fast membership checking, and never return false negatives, though they can return false positives. Several applications of Bloom filters are also mentioned such as spell checkers, password checking, and caching.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Fast, stable and scalable true radix sorting with Matt Dowle at useR! AalborgSri Ambati
This document discusses the history and implementation of radix sorting in R. It begins by describing how Tom Short proposed using sort.list(x, method="radix") in 2009 to speed up sorting in data.table. This led to Matt Dowle implementing a true radix sorting algorithm called forderv() in data.table. Radix sorting provides dramatically faster sorting for large integer vectors compared to the default R sorting methods. The document then provides technical details on how radix sorting works and how it was optimized in data.table to support larger vectors and data types. In the end, it encourages testing the forderv() radix sorting function in data.table.
Time Series Meetup: Virtual Edition | July 2020InfluxData
This document summarizes an online meetup about anomaly detection using Median Absolute Deviation (MAD) in Flux. It includes an agenda with introductions, a talk on MAD, and time for Q&A. The talk explains how MAD works through numerical examples and uses common Flux functions like group(), drop(), median(), map(), and join(). It also provides instructions for writing custom Flux functions and packages and contributing them to the Flux codebase through testing and compilation.
Vasia Kalavri – Training: Gelly School Flink Forward
- Gelly is a graph processing library built on Apache Flink that provides APIs for Java and Scala to work with graphs and perform graph algorithms
- It allows seamless integration of graph-based and record-based analysis by mixing the Gelly and Flink DataSet APIs
- Common graph algorithms like connected components, PageRank, and similarity recommendations are included in the library
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
The document discusses Bloom filters, which are compact data structures used to represent sets probabilistically. Bloom filters allow membership queries to determine if an element is in a set, but may return false positives. They provide a more space-efficient alternative to other data structures like hash tables. The key properties of Bloom filters are that they require less memory than other solutions, allow fast membership checking, and never return false negatives, though they can return false positives. Several applications of Bloom filters are also mentioned such as spell checkers, password checking, and caching.
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
This document discusses using HyperLogLog (HLL) in Hive to efficiently estimate the number of unique elements or cardinality in big datasets. It describes how HLL provides fast approximate counting using probabilistic data structures. It covers implementing HLL as user-defined functions in Hive, comparing different open source implementations, and examples of using HLL to estimate unique visitors per day and in a rolling window.
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
This document discusses different frameworks for big data processing at ResearchGate, including Hive, MapReduce, and Flink. It provides an example of using Hive to find the top 5 coauthors for each author based on publication data. Code snippets in Hive SQL and Java are included to implement the top k coauthors user defined aggregate function (UDAF) in Hive. The document evaluates different frameworks based on criteria like features, performance, and usability.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
In this InfluxDays NYC 2019 talk, InfluxData Founder & CTO Paul Dix will outline his vision around the platform and its new data scripting and query language Flux, and he will give the latest updates on InfluxDB time series database. This talk will walk through the vision and architecture with demonstrations of working prototypes of the projects.
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Collections in .net technology (2160711)Janki Shah
Collections in .NET Framework.
- What is collections?
- Needs of Collections/ importance of collection
- various most useful classes of collection such as
ArrayList, Hashtable, Stack, Queue, BitArray, SortedList
Profiling in Python provides concise summaries of key profiling tools in 3 sentences:
cProfile and line_profiler profile execution time and identify slow lines of code. memory_profiler profiles memory usage with line-by-line or time-based outputs. YEP extends profiling to compiled C/C++ extensions like Cython modules, which are not covered by the standard Python profilers.
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
This document discusses common technical problems encountered when building products with Spark and provides solutions. It covers Spark exceptions like out of memory errors and shuffle file problems. It recommends increasing partitions and memory configurations. The document also discusses optimizing Spark code using functional programming principles like strong and weak pipelining, and leveraging monoid structures to reduce shuffling. Overall it provides tips to debug issues, optimize performance, and productize Spark applications.
This document discusses control structures and break and continue statements in JavaScript. It begins by providing an example of a for loop that counts from 1 to 6000. It then discusses arrays in JavaScript, including how to declare and access single and multi-dimensional arrays. Some key array methods like reverse() and sort() are also mentioned. The document concludes by explaining how to write a web page that prompts the user for 10 words and displays them in sorted order.
Python for R developers and data scientistsLambda Tree
This is an introductory talk aimed at data scientists who are well versed with R but would like to work with Python as well. I will cover common workflows in R and how they translate into Python. No Python experience necessary.
This document summarizes how switching from Hadoop to Spark for data science applications improved performance, reliability, and reduced costs at Salesforce. Some key issues addressed were handling large datasets across many S3 prefixes, efficiently computing segment overlap on skewed user data, and performing joins on highly skewed datasets. These changes resulted in applications that were 100x faster, used 10x less data, had fewer failures, and reduced infrastructure costs.
Stratosphere Intro (Java and Scala Interface)Robert Metzger
A quick walk overview of Stratosphere, including our Scala programming interface.
See also bigdataclass.org for two self-paced Stratosphere Big Data exercises.
More information about Stratosphere: stratosphere.eu
Presentation given at the 2013 Clojure Conj on core.matrix, a library that brings muli-dimensional array and matrix programming capabilities to Clojure
How to Create Database component -Enterprise Application Using C# Lab priya Nithya
This document contains C# code for a student database application. It defines a Studb class with methods to get student IDs from a database table to populate a combo box, add a new student record to the database table, and display student records from the table in a data grid view. It also includes code for a Form1 class that uses the Studb class methods to load student IDs, add a new student, and display a student's records on button clicks when connected to a SQL database.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Real-Time Spark: From Interactive Queries to StreamingDatabricks
This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses:
1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date.
2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs.
3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
This document discusses using HyperLogLog (HLL) in Hive to efficiently estimate the number of unique elements or cardinality in big datasets. It describes how HLL provides fast approximate counting using probabilistic data structures. It covers implementing HLL as user-defined functions in Hive, comparing different open source implementations, and examples of using HLL to estimate unique visitors per day and in a rolling window.
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
This document discusses different frameworks for big data processing at ResearchGate, including Hive, MapReduce, and Flink. It provides an example of using Hive to find the top 5 coauthors for each author based on publication data. Code snippets in Hive SQL and Java are included to implement the top k coauthors user defined aggregate function (UDAF) in Hive. The document evaluates different frameworks based on criteria like features, performance, and usability.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
In this InfluxDays NYC 2019 talk, InfluxData Founder & CTO Paul Dix will outline his vision around the platform and its new data scripting and query language Flux, and he will give the latest updates on InfluxDB time series database. This talk will walk through the vision and architecture with demonstrations of working prototypes of the projects.
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Collections in .net technology (2160711)Janki Shah
Collections in .NET Framework.
- What is collections?
- Needs of Collections/ importance of collection
- various most useful classes of collection such as
ArrayList, Hashtable, Stack, Queue, BitArray, SortedList
Profiling in Python provides concise summaries of key profiling tools in 3 sentences:
cProfile and line_profiler profile execution time and identify slow lines of code. memory_profiler profiles memory usage with line-by-line or time-based outputs. YEP extends profiling to compiled C/C++ extensions like Cython modules, which are not covered by the standard Python profilers.
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
This document discusses common technical problems encountered when building products with Spark and provides solutions. It covers Spark exceptions like out of memory errors and shuffle file problems. It recommends increasing partitions and memory configurations. The document also discusses optimizing Spark code using functional programming principles like strong and weak pipelining, and leveraging monoid structures to reduce shuffling. Overall it provides tips to debug issues, optimize performance, and productize Spark applications.
This document discusses control structures and break and continue statements in JavaScript. It begins by providing an example of a for loop that counts from 1 to 6000. It then discusses arrays in JavaScript, including how to declare and access single and multi-dimensional arrays. Some key array methods like reverse() and sort() are also mentioned. The document concludes by explaining how to write a web page that prompts the user for 10 words and displays them in sorted order.
Python for R developers and data scientistsLambda Tree
This is an introductory talk aimed at data scientists who are well versed with R but would like to work with Python as well. I will cover common workflows in R and how they translate into Python. No Python experience necessary.
This document summarizes how switching from Hadoop to Spark for data science applications improved performance, reliability, and reduced costs at Salesforce. Some key issues addressed were handling large datasets across many S3 prefixes, efficiently computing segment overlap on skewed user data, and performing joins on highly skewed datasets. These changes resulted in applications that were 100x faster, used 10x less data, had fewer failures, and reduced infrastructure costs.
Stratosphere Intro (Java and Scala Interface)Robert Metzger
A quick walk overview of Stratosphere, including our Scala programming interface.
See also bigdataclass.org for two self-paced Stratosphere Big Data exercises.
More information about Stratosphere: stratosphere.eu
Presentation given at the 2013 Clojure Conj on core.matrix, a library that brings muli-dimensional array and matrix programming capabilities to Clojure
How to Create Database component -Enterprise Application Using C# Lab priya Nithya
This document contains C# code for a student database application. It defines a Studb class with methods to get student IDs from a database table to populate a combo box, add a new student record to the database table, and display student records from the table in a data grid view. It also includes code for a Form1 class that uses the Studb class methods to load student IDs, add a new student, and display a student's records on button clicks when connected to a SQL database.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Real-Time Spark: From Interactive Queries to StreamingDatabricks
This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses:
1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date.
2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs.
3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.
The document outlines various statistical and data analysis techniques that can be performed in R including importing data, data visualization, correlation and regression, and provides code examples for functions to conduct t-tests, ANOVA, PCA, clustering, time series analysis, and producing publication-quality output. It also reviews basic R syntax and functions for computing summary statistics, transforming data, and performing vector and matrix operations.
“Practical Data Science”. R programming language and Jupiter notebooks are used in this tutorial. However, the concepts are generic and can be applied for Python or other programming language users as well.
The document discusses several common Java anti-patterns, including:
1) Approving a task by rejecting it in a method called "approve".
2) Avoiding the use of helper libraries to simplify tasks like file name parsing.
3) Using reflection when direct method calls would suffice.
Wprowadzenie do technologi Big Data i Apache HadoopSages
The document introduces concepts related to Big Data technology including volume, variety, and velocity of data. It discusses Hadoop architecture including HDFS, MapReduce, YARN, and the Hadoop ecosystem. Examples are provided of common Big Data problems and how they can be solved using Hadoop frameworks like Pig, Hive, and Ambari.
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
This document contains instructions for several Java programming exercises involving classes, packages, inheritance, overriding, exceptions, and threads. It outlines code for programs that demonstrate concepts like classes and objects, command line arguments, bitwise operators, method overriding, and packages. For each exercise, it provides the aim, algorithm, sample code, input/output, and result to verify the output. The exercises are intended to help students learn and practice core Java programming concepts.
Analyzing On-Chip Interconnect with Modern C++Jeff Trull
Slides for Silicon Valley Code Camp 2014 presentation. Covers using Modern C++ style libraries and code to analyze interconnect parasitics on ASICs. Matrix and Graph algorithms are described in detail.
This document provides an agenda and overview for a Spark workshop covering Spark basics and streaming. The agenda includes sections on Scala, Spark, Spark SQL, and Spark Streaming. It discusses Scala concepts like vals, vars, defs, classes, objects, and pattern matching. It also covers Spark RDDs, transformations, actions, sources, and the spark-shell. Finally, it briefly introduces Spark concepts like broadcast variables, accumulators, and spark-submit.
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
Szybkie wprowadzenie do technologii Pig i Hive z ekosystemu Hadoop. Prezentacja wykonana w ramach warsztatów Codepot w dniu 29.08.2015. Prezentacja wykonana przez Radosława Stankiewicza oraz Bartłomieja Tartanusa.
Peter Lawrey is the CEO of Chronicle Software. He has 7 years experience working as a Java developer for investment banks and trading firms. Chronicle Software helps companies migrate to high performance Java code and was involved in one of the first large Java 8 projects in production in December 2014. The company offers workshops, training, consulting and custom development services. The talk will cover reading and writing lambdas, capturing vs non-capturing lambdas, transforming imperative code to streams, mixing imperative and functional code, and taking Q&A.
This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
This document provides an overview of big data analytics with Scala, including common frameworks and techniques. It discusses Lambda architecture, MapReduce, word counting examples, Scalding for batch and streaming jobs, Apache Storm, Trident, SummingBird for unified batch and streaming, and Apache Spark for fast cluster computing with resilient distributed datasets. It also covers clustering with Mahout, streaming word counting, and analytics platforms that combine batch and stream processing.
After migrating a three year old C# project to Java we ending up with a significant portion of legacy code using lambdas in Java. What was some of the good use cases, code which could be written better and the problems we had migrating from C#. At the end we look at the performance implications of using Lambdas.
The document discusses functional programming and lambda expressions in Java 8. It begins by defining functional programming and predicates from predicate logic. It then discusses the key properties of functional programming including no states, passing control, single large function, and no cycles. The document provides examples of determining if a number is prime in both imperative and declarative styles using Java 8 lambda expressions. It also provides examples of getting the first doubled number greater than 3 from a list using both declarative and imperative approaches. The examples demonstrate the use of streams, filters, maps and other functional operations.
The basic concept for the data structure.
It covers these topics
System Life Cycle
Algorithm Specification
Data Abstraction
Performance Analysis
Space Complexity
Time Complexity
Asymptotic Notation
Text Book: Fundamentals of Data Structures in C++
E. Horowitz, et al.
Structured Streaming provides a scalable and fault-tolerant stream processing framework on Spark SQL. It allows users to write streaming jobs using simple batch-like SQL queries that Spark will automatically optimize for efficient streaming execution. This includes handling out-of-order and late data, checkpointing to ensure fault-tolerance, and providing end-to-end exactly-once guarantees. The talk discusses how Structured Streaming represents streaming data as unbounded tables and executes queries incrementally to produce streaming query results.
Story of static code analyzer developmentAndrey Karpov
The document discusses the history and development of static code analyzers. It describes how early tools used regular expressions that were ineffective for complex code analysis. Modern static analyzers overcome these limitations through techniques like type inference, data flow analysis, symbolic execution, and pattern-based analysis. They also leverage method annotations and a mixture of analysis approaches. While machine learning is hyped, static analysis remains very challenging due to the complexity of code and rapid language evolution.
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
SubScript - это расширение языка Scala, добавляющее поддержку конструкций и синтаксиса аглебры общающихся процессов (Algebra of Communicating Processes, ACP). SubScript является перспективным расширением, применимым как для разработки высоконагруженных параллельных систем, так и для простых персональных приложений.
Orca: Nocode Graphical Editor for Container OrchestrationPedro J. Molina
Tool demo on CEDI/SISTEDES/JISBD2024 at A Coruña, Spain. 2024.06.18
"Orca: Nocode Graphical Editor for Container Orchestration"
by Pedro J. Molina PhD. from Metadev
Just like life, our code must adapt to the ever changing world we live in. From one day coding for the web, to the next for our tablets or APIs or for running serverless applications. Multi-runtime development is the future of coding, the future is to be dynamic. Let us introduce you to BoxLang.
Boost Your Savings with These Money Management AppsJhone kinadey
A money management app can transform your financial life by tracking expenses, creating budgets, and setting financial goals. These apps offer features like real-time expense tracking, bill reminders, and personalized insights to help you save and manage money effectively. With a user-friendly interface, they simplify financial planning, making it easier to stay on top of your finances and achieve long-term financial stability.
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid
IBM watsonx Code Assistant for Z, our latest Generative AI-assisted mainframe application modernization solution. Mainframe (IBM Z) application modernization is a topic that every mainframe client is addressing to various degrees today, driven largely from digital transformation. With generative AI comes the opportunity to reimagine the mainframe application modernization experience. Infusing generative AI will enable speed and trust, help de-risk, and lower total costs associated with heavy-lifting application modernization initiatives. This document provides an overview of the IBM watsonx Code Assistant for Z which uses the power of generative AI to make it easier for developers to selectively modernize COBOL business services while maintaining mainframe qualities of service.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
The Role of DevOps in Digital Transformation.pdfmohitd6
DevOps plays a crucial role in driving digital transformation by fostering a collaborative culture between development and operations teams. This approach enhances the speed and efficiency of software delivery, ensuring quicker deployment of new features and updates. DevOps practices like continuous integration and continuous delivery (CI/CD) streamline workflows, reduce manual errors, and increase the overall reliability of software systems. By leveraging automation and monitoring tools, organizations can improve system stability, enhance customer experiences, and maintain a competitive edge. Ultimately, DevOps is pivotal in enabling businesses to innovate rapidly, respond to market changes, and achieve their digital transformation goals.
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsOnePlan Solutions
Clinical operations professionals encounter unique challenges. Balancing regulatory requirements, tight timelines, and the need for cross-functional collaboration can create significant internal pressures. Our upcoming webinar will introduce key strategies and tools to streamline and enhance clinical development processes, helping you overcome these challenges.
Streamlining End-to-End Testing Automation with Azure DevOps Build & Release Pipelines
Automating end-to-end (e2e) test for Android and iOS native apps, and web apps, within Azure build and release pipelines, poses several challenges. This session dives into the key challenges and the repeatable solutions implemented across multiple teams at a leading Indian telecom disruptor, renowned for its affordable 4G/5G services, digital platforms, and broadband connectivity.
Challenge #1. Ensuring Test Environment Consistency: Establishing a standardized test execution environment across hundreds of Azure DevOps agents is crucial for achieving dependable testing results. This uniformity must seamlessly span from Build pipelines to various stages of the Release pipeline.
Challenge #2. Coordinated Test Execution Across Environments: Executing distinct subsets of tests using the same automation framework across diverse environments, such as the build pipeline and specific stages of the Release Pipeline, demands flexible and cohesive approaches.
Challenge #3. Testing on Linux-based Azure DevOps Agents: Conducting tests, particularly for web and native apps, on Azure DevOps Linux agents lacking browser or device connectivity presents specific challenges in attaining thorough testing coverage.
This session delves into how these challenges were addressed through:
1. Automate the setup of essential dependencies to ensure a consistent testing environment.
2. Create standardized templates for executing API tests, API workflow tests, and end-to-end tests in the Build pipeline, streamlining the testing process.
3. Implement task groups in Release pipeline stages to facilitate the execution of tests, ensuring consistency and efficiency across deployment phases.
4. Deploy browsers within Docker containers for web application testing, enhancing portability and scalability of testing environments.
5. Leverage diverse device farms dedicated to Android, iOS, and browser testing to cover a wide range of platforms and devices.
6. Integrate AI technology, such as Applitools Visual AI and Ultrafast Grid, to automate test execution and validation, improving accuracy and efficiency.
7. Utilize AI/ML-powered central test automation reporting server through platforms like reportportal.io, providing consolidated and real-time insights into test performance and issues.
These solutions not only facilitate comprehensive testing across platforms but also promote the principles of shift-left testing, enabling early feedback, implementing quality gates, and ensuring repeatability. By adopting these techniques, teams can effectively automate and execute tests, accelerating software delivery while upholding high-quality standards across Android, iOS, and web applications.
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
These are the slides of the presentation given during the Q2 2024 Virtual VictoriaMetrics Meetup. View the recording here: https://www.youtube.com/watch?v=hzlMA_Ae9_4&t=206s
Topics covered:
1. What is VictoriaLogs
Open source database for logs
● Easy to setup and operate - just a single executable with sane default configs
● Works great with both structured and plaintext logs
● Uses up to 30x less RAM and up to 15x disk space than Elasticsearch
● Provides simple yet powerful query language for logs - LogsQL
2. Improved querying HTTP API
3. Data ingestion via Syslog protocol
* Automatic parsing of Syslog fields
* Supported transports:
○ UDP
○ TCP
○ TCP+TLS
* Gzip and deflate compression support
* Ability to configure distinct TCP and UDP ports with distinct settings
* Automatic log streams with (hostname, app_name, app_id) fields
4. LogsQL improvements
● Filtering shorthands
● week_range and day_range filters
● Limiters
● Log analytics
● Data extraction and transformation
● Additional filtering
● Sorting
5. VictoriaLogs Roadmap
● Accept logs via OpenTelemetry protocol
● VMUI improvements based on HTTP querying API
● Improve Grafana plugin for VictoriaLogs -
https://github.com/VictoriaMetrics/victorialogs-datasource
● Cluster version
○ Try single-node VictoriaLogs - it can replace 30-node Elasticsearch cluster in production
● Transparent historical data migration to object storage
○ Try single-node VictoriaLogs with persistent volumes - it compresses 1TB of production logs from
Kubernetes to 20GB
● See https://docs.victoriametrics.com/victorialogs/roadmap/
Try it out: https://victoriametrics.com/products/victorialogs/
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio, Inc.
Alluxio Webinar
June. 18, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jianjian Xie (Staff Software Engineer, Alluxio)
As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of compute and storage creates latency challenges when querying datasets; scanning data between storage and compute tiers becomes I/O bound. On the other hand, cloud API costs related to GET/LIST operations and cross-region data transfer add up quickly.
The newly introduced Trino file system cache by Alluxio aims to overcome the above challenges. In this session, Jianjian will dive into Trino data caching strategies, the latest test results, and discuss the multi-level caching architecture. This architecture makes Trino 10x faster for data lakes of any scale, from GB to EB.
What you will learn:
- Challenges relating to the speed and costs of running Trino in the cloud
- The new Trino file system cache feature overview, including the latest development status and test results
- A multi-level cache framework for maximized speed, including Trino file system cache and Alluxio distributed cache
- Real-world cases, including a large online payment firm and a top ridesharing company
- The future roadmap of Trino file system cache and Trino-Alluxio integration
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Ortus Solutions, Corp
Join us for a session exploring CommandBox 6’s smooth website transition and efficient deployment. CommandBox revolutionizes web development, simplifying tasks across Linux, Windows, and Mac platforms. Gain insights and practical tips to enhance your development workflow.
Come join us for an enlightening session where we delve into the smooth transition of current websites and the efficient deployment of new ones using CommandBox 6. CommandBox has revolutionized web development, consistently introducing user-friendly enhancements that catalyze progress in the field. During this presentation, we’ll explore CommandBox’s rich history and showcase its unmatched capabilities within the realm of ColdFusion, covering both major variations.
The journey of CommandBox has been one of continuous innovation, constantly pushing boundaries to simplify and optimize development processes. Regardless of whether you’re working on Linux, Windows, or Mac platforms, CommandBox empowers developers to streamline tasks with unparalleled ease.
In our session, we’ll illustrate the simple process of transitioning existing websites to CommandBox 6, highlighting its intuitive features and seamless integration. Moreover, we’ll unveil the potential for effortlessly deploying multiple websites, demonstrating CommandBox’s versatility and adaptability.
Join us on this journey through the evolution of web development, guided by the transformative power of CommandBox 6. Gain invaluable insights, practical tips, and firsthand experiences that will enhance your development workflow and embolden your projects.
A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.
Building API data products on top of your real-time data infrastructureconfluent
This talk and live demonstration will examine how Confluent and Gravitee.io integrate to unlock value from streaming data through API products.
You will learn how data owners and API providers can document, secure data products on top of Confluent brokers, including schema validation, topic routing and message filtering.
You will also see how data and API consumers can discover and subscribe to products in a developer portal, as well as how they can integrate with Confluent topics through protocols like REST, Websockets, Server-sent Events and Webhooks.
Whether you want to monetize your real-time data, enable new integrations with partners, or provide self-service access to topics through various protocols, this webinar is for you!
2. Generics: Have Typed Containers
List list = new ArrayList();
list.add("one");
list.add(1);
list.add(1L);
list.add(new Object());
Before After
List<String> list = new ArrayList<>();
list.add("one");
list.add("1");
list.add(“1212”);
Typed ContainerContainer
3. A Thing we cannot do with containers
Transform container of String to a container of Integers
4. Monad: {Type Constructor, Bind, Return}
Optional Stream
Type Constructor of of
Bind map map
Return get collect
Example:
Stream.of(1,2,3,4,5).map(i -> i + 2).collect(Collectors.toList())
Optional.of("Test").map(s -> s.concat(" ").concat(" One")).get()
https://en.wikipedia.org/wiki/Monad_(functional_programming)
5. Use Case of Optional: Utility method to get
encrypted password in base 64
public String encodePassword1(String password) throws Exception {
return Optional.ofNullable(password)
.map(String::getBytes)
.map(MessageDigest.getInstance("SHA1")::digest)
.map(Base64.getEncoder()::encodeToString)
.orElse(null);
//.orElseThrow(IllegalArgumentException::new);
}
public String encodePassword2(String password) throws Exception{
if (password == null)
return password;
return Base64.encode(MessageDigest.getInstance("SHA1").digest(password.getBytes()));
}
6. Exercise
• Write a method that accept String (which can be null) and return
number of characters in the String. If null throw
IllegalArgumentException.
7. Use Cases of Stream: Intro
There many Stream type constructors
• Directly from Stream class
Stream.of(T…t)
• From Collections
List<String> list = new ArrayList<>();
list.stream();
• Primitive Streams
• IntStream
• IntStream interator
IntStream.iterate(1, i -> i++).limit(100) [ for(int i = 0; i < 100; i++) ]
• Random IntStream
ThreadLocalRandom.current().ints()
• LongStream (similar to intstream)
• Buffered Reader
• new BufferedReader(new FileReader("input.txt")).lines();
9. Side Note: Using Parallel Stream
StopWatch stopWatch = new StopWatch();
stopWatch.start();
long sampleSize = 1_000_000_000L;
long count = ThreadLocalRandom.current().doubles(0,1)
.limit(sampleSize)
.parallel()
.map(d -> Math.pow(d, 2D))
.map(d -> d + Math.pow(ThreadLocalRandom.current().nextDouble(0D,1D), 2D))
.map(Math::sqrt)
.filter(d -> d < 1D)
.count();
stopWatch.stop();
System.out.printf("Original: %s Computed: %s , Time: %s %n",
Math.PI,
4D * count / sampleSize,
stopWatch);
10. Side Note: Use of Core
Original: 3.141592653589793 Computed: 3.141606068 , Time: 0:00:04.342
With Parallel Streams
Original: 3.141592653589793 Computed: 3.141551928 , Time: 0:00:38.072
With Serial Streams
Base Line Processes Usage
11. Exercise
• Read File with one billion numbers and find the Average
https://tinyurl.com/y8k8dfkf
12. Monad As A Design Pattern
Structural Pattern
• Operation Chaining
• Apply Functions Regardless of the result of any of them
https://github.com/iluwatar/java-design-patterns/tree/master/monad
13. Try Monad
Optional for NullPointerException, Try For RuntimeException
import java.util.function.Function;
public class Try<I> {
private I i;
private Try(I i) {
this.i = i;
}
// Type Constructor
public static <O> Try<O> of(O instance) {
return new Try<O>(instance);
}
// Bind
public <O> Try<O> attempt(Function<I, O> function) {
try {
O o = function.apply(i);
return new Try<O>(o);
} catch (Throwable t) {
return new Try<O>(null);
}
}
// Return
public I get() {
return i;
}
// Return
public I getOrElse(I defaultValue) {
return i == null ? defaultValue : i;
}
}
14. Exercise
• Write a monad which uses provided default value if the value is null when
chaining.
String lname = DeafultM.of(user)
.map(User::getName, “Roy”)
.map(s -> lastNameMap.get(s), “James”)
.get();
• Write a monad which uses provided default value if the value is not
provided criteria use when chaining.
String lname = DeafultM.of(user)
.map(User::getName, s -> s.length() < 1, “Roy”)
.map(s -> lastNameMap.get(s), s.contains(“R”), “James”)
.get();
15. Things to study more
• Monad functions: zip, map, flatmap, sequence
• Monad Design Pattern: https://github.com/iluwatar/java-design-
patterns/tree/master/monad
• Monoids: https://en.wikipedia.org/wiki/Monoid
• Use of Optional: https://www.programcreek.com/java-api-
examples/?api=java.util.Optional
• Functor and Monad Examples in Plain Java.
https://dzone.com/articles/functor-and-monad-examples-in-plain-
java