Have you ever written a long project for a simple column rename and thought, this should be easier? What about nicely named output statements? Yeah they bother me too. Oh, and DEDUP(SORT(DISTINCT()))? There is a better way! Learn how Dapper can help!
When transitioning to functional programming as an already experienced developer in the imperative arts, one important skill fundamental to my technical maturity was thinking in terms of the properties of the systems I was building.
From modeling application domain constraints to testing distributed systems at scale in production, I found that thinking in properties can help you and your team build more sustainable systems.
Property-based testing provides a launchpad to discover and practice this mental model in your software development activities.
This session is for developers starting to exploit property-based testing from beginner to intermediate level and will:
- quickly review property-based testing
- identify common pitfalls with property-based testing alone
- suggest how to combine with other techniques and approaches to avoid their pitfalls
- illustrate think in properties so you can employ property-based “tests” at all phases of development
Limited exposure to the idea of property-based testing is desirable but not required. Code examples will be in Haskell.
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Thoughtworks
3 years ago, Springer decided to use Scala on a large, strategic project. This talk is about the journey the development teams made. Why did they choose Scala in the first place? Did they get what they hoped for? What challenges and surprises did they encounter along the way? And, most importantly, are they still happy with their choice?
Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3), describing what I got wrong the first time around, and what eventually saved the day. I've used Hive in environments with a number of events ranging from a few million to a few billion a day, so hopefully there'll be something for everyone.
Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages.
This talk is about Hadoop's Streaming API, and the best way we found to run Perl jobs on Amazon's Elastic MapReduce platform.
WebAssembly. Neither Web Nor Assembly, All RevolutionaryC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2tWqrMm.
Jay Phelps talks about WebAssembly, a bytecode designed and maintained by some of the major players in tech: Google, Microsoft, Apple, Mozilla, Intel, LG, and many others. He talks about what WebAssembly is and what it isn’t. Filmed at qconsf.com.
Jay Phelps is the Chief Software Architect and Co-founder at This Dot, where they provide support, training, mentoring, and software design. Previously, he worked as a Senior Software Engineer at Netflix.
When transitioning to functional programming as an already experienced developer in the imperative arts, one important skill fundamental to my technical maturity was thinking in terms of the properties of the systems I was building.
From modeling application domain constraints to testing distributed systems at scale in production, I found that thinking in properties can help you and your team build more sustainable systems.
Property-based testing provides a launchpad to discover and practice this mental model in your software development activities.
This session is for developers starting to exploit property-based testing from beginner to intermediate level and will:
- quickly review property-based testing
- identify common pitfalls with property-based testing alone
- suggest how to combine with other techniques and approaches to avoid their pitfalls
- illustrate think in properties so you can employ property-based “tests” at all phases of development
Limited exposure to the idea of property-based testing is desirable but not required. Code examples will be in Haskell.
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Thoughtworks
3 years ago, Springer decided to use Scala on a large, strategic project. This talk is about the journey the development teams made. Why did they choose Scala in the first place? Did they get what they hoped for? What challenges and surprises did they encounter along the way? And, most importantly, are they still happy with their choice?
Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3), describing what I got wrong the first time around, and what eventually saved the day. I've used Hive in environments with a number of events ranging from a few million to a few billion a day, so hopefully there'll be something for everyone.
Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages.
This talk is about Hadoop's Streaming API, and the best way we found to run Perl jobs on Amazon's Elastic MapReduce platform.
WebAssembly. Neither Web Nor Assembly, All RevolutionaryC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2tWqrMm.
Jay Phelps talks about WebAssembly, a bytecode designed and maintained by some of the major players in tech: Google, Microsoft, Apple, Mozilla, Intel, LG, and many others. He talks about what WebAssembly is and what it isn’t. Filmed at qconsf.com.
Jay Phelps is the Chief Software Architect and Co-founder at This Dot, where they provide support, training, mentoring, and software design. Previously, he worked as a Senior Software Engineer at Netflix.
Introduction to Terraform - presented at the Perth Python & Django meetup on March 1 2018. Demo code repo can be found here: https://github.com/jaymickey/terraform-demo
Feature Talk: Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird
Talk Abstract: Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Talk Agenda:
• Intro
• Live, Interactive Recommendations Demo
• Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird (advancedspark.com)
• Types of Similarity
• Euclidean vs. Non-Euclidean Similarity
• Jaccard Similarity
• Cosine Similarity
• LogLikelihood Similarity
• Edit Distance
• Text-based Similarities and Analytics
• Word2Vec
• LDA Topic Extraction
• TextRank
• Similarity-based Recommendations
• User-to-User
• Content-based, Item-to-Item (Amazon)
• Collaborative-based, User-to-Item (Netflix)
• Graph-based, Item-to-Item "Pathways" (Spotify)
• Aggregations, Approximations, and Similarities at Scale
• Twitter Algebird
• MinHash and Bucketing
• Locality Sensitive Hashing (LSH)
• BloomFilters
• CountMin Sketch
• HyperLogLog
• Q & A
Speaker Bio: Chris Fregly is a Research Engineer @ Flux Capacitor AI in SF, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
A comprehensive walkthrough of how to manage infrastructure-as-code using Terraform. This presentation includes an introduction to Terraform, a discussion of how to manage Terraform state, how to use Terraform modules, an overview of best practices (e.g. isolation, versioning, loops, if-statements), and a list of gotchas to look out for.
For a written and more in-depth version of this presentation, check out the "Comprehensive Guide to Terraform" blog post series: https://blog.gruntwork.io/a-comprehensive-guide-to-terraform-b3d32832baca
Terraform, is no doubt very flexible and powerful. The question is, how do we write Terraform code and construct our infrastructure in a reproducible fashion that makes sense? How can we keep code DRY, segment state, and reduce the risk of making changes to our service/stack/infrastructure?
HashiCorp’s infrastructure management tool, Terraform, is no doubt very flexible and powerful. The question is, how do we write Terraform code and construct our infrastructure in a reproducible fashion that makes sense? How can we keep code DRY, segment state, and reduce the risk of making changes to our service/stack/infrastructure?
This talk describes a design pattern to help answer the previous questions. The talk is divided into two sections, with the first section describing and defining the design pattern with a Deployment Example. The second part uses a multi-repository GitHub organization to create a Real World Example of the design pattern.
Why is My Spark Job Failing? by Sandy Ryza of ClouderaData Con LA
Abstract:
You are not a bad person. But your Apache Spark job is failing. It is running out of memory. It is stalled. It is complaining that no executors have registered or spitting out "Filesystem closed" exceptions with lines upon lines of $anon$1's or being consumed by a swarm of locusts the likes of which have not been seen since Moses crossed the Red Sea. Or it's completing -- 20 times as slow as it should reasonably take. Why? In this talk, you'll learn the internals of Spark jobs, the root causes of such ailments, and tuning strategies for avoiding them.
Bio:
Sandy Ryza is a data scientist at Cloudera, an Apache Hadoop committer, and a Spark contributor. Sandy is also the co-author of Advanced Analytics with Spark.
Slides form Config Management Camp, looking at how you can take a collaborative GitFlow approach to Terraform using Remote State, Modules and Dynamically Generated Credentials using Vault
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
This is a work in progress of a talk for the Scala User Group in Tokyo.
It touches on basics and some ideas behind Reactive Streams as well as the implementation shipped by Akka.
In this talk I go over how Bankrate leveraged Terraform to go from a company where infrastructure was provisioned via tickets to a team to a company where developers fully own their own infrastructure. No tickets, no paging for help, no bottlenecks - just education and enablement. Interested in hearing the talk? Please reach out!
Hiveminder - Everything but the Secret SauceJesse Vincent
Ten tools and techniques to help you:
Find bugs faster バグの検出をもっと素早く
Build web apps ウェブアプリの構築
Ship software ソフトのリリース
Get input from users ユーザからの入力を受けつける
Own the Inbox 受信箱を用意する
今日の話
Introduction to Terraform - presented at the Perth Python & Django meetup on March 1 2018. Demo code repo can be found here: https://github.com/jaymickey/terraform-demo
Feature Talk: Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird
Talk Abstract: Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Talk Agenda:
• Intro
• Live, Interactive Recommendations Demo
• Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird (advancedspark.com)
• Types of Similarity
• Euclidean vs. Non-Euclidean Similarity
• Jaccard Similarity
• Cosine Similarity
• LogLikelihood Similarity
• Edit Distance
• Text-based Similarities and Analytics
• Word2Vec
• LDA Topic Extraction
• TextRank
• Similarity-based Recommendations
• User-to-User
• Content-based, Item-to-Item (Amazon)
• Collaborative-based, User-to-Item (Netflix)
• Graph-based, Item-to-Item "Pathways" (Spotify)
• Aggregations, Approximations, and Similarities at Scale
• Twitter Algebird
• MinHash and Bucketing
• Locality Sensitive Hashing (LSH)
• BloomFilters
• CountMin Sketch
• HyperLogLog
• Q & A
Speaker Bio: Chris Fregly is a Research Engineer @ Flux Capacitor AI in SF, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
A comprehensive walkthrough of how to manage infrastructure-as-code using Terraform. This presentation includes an introduction to Terraform, a discussion of how to manage Terraform state, how to use Terraform modules, an overview of best practices (e.g. isolation, versioning, loops, if-statements), and a list of gotchas to look out for.
For a written and more in-depth version of this presentation, check out the "Comprehensive Guide to Terraform" blog post series: https://blog.gruntwork.io/a-comprehensive-guide-to-terraform-b3d32832baca
Terraform, is no doubt very flexible and powerful. The question is, how do we write Terraform code and construct our infrastructure in a reproducible fashion that makes sense? How can we keep code DRY, segment state, and reduce the risk of making changes to our service/stack/infrastructure?
HashiCorp’s infrastructure management tool, Terraform, is no doubt very flexible and powerful. The question is, how do we write Terraform code and construct our infrastructure in a reproducible fashion that makes sense? How can we keep code DRY, segment state, and reduce the risk of making changes to our service/stack/infrastructure?
This talk describes a design pattern to help answer the previous questions. The talk is divided into two sections, with the first section describing and defining the design pattern with a Deployment Example. The second part uses a multi-repository GitHub organization to create a Real World Example of the design pattern.
Why is My Spark Job Failing? by Sandy Ryza of ClouderaData Con LA
Abstract:
You are not a bad person. But your Apache Spark job is failing. It is running out of memory. It is stalled. It is complaining that no executors have registered or spitting out "Filesystem closed" exceptions with lines upon lines of $anon$1's or being consumed by a swarm of locusts the likes of which have not been seen since Moses crossed the Red Sea. Or it's completing -- 20 times as slow as it should reasonably take. Why? In this talk, you'll learn the internals of Spark jobs, the root causes of such ailments, and tuning strategies for avoiding them.
Bio:
Sandy Ryza is a data scientist at Cloudera, an Apache Hadoop committer, and a Spark contributor. Sandy is also the co-author of Advanced Analytics with Spark.
Slides form Config Management Camp, looking at how you can take a collaborative GitFlow approach to Terraform using Remote State, Modules and Dynamically Generated Credentials using Vault
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
This is a work in progress of a talk for the Scala User Group in Tokyo.
It touches on basics and some ideas behind Reactive Streams as well as the implementation shipped by Akka.
In this talk I go over how Bankrate leveraged Terraform to go from a company where infrastructure was provisioned via tickets to a team to a company where developers fully own their own infrastructure. No tickets, no paging for help, no bottlenecks - just education and enablement. Interested in hearing the talk? Please reach out!
Hiveminder - Everything but the Secret SauceJesse Vincent
Ten tools and techniques to help you:
Find bugs faster バグの検出をもっと素早く
Build web apps ウェブアプリの構築
Ship software ソフトのリリース
Get input from users ユーザからの入力を受けつける
Own the Inbox 受信箱を用意する
今日の話
A whirlwind tour of the modules that any perl hacker, from beginner to experienced, should use and why.
Handout: List of modules in the talk along with many more: https://sites.google.com/site/perlhercynium/TEPHT-List2.pdf?attredirects=0
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users.
Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.
This presentation goes over basic exploitation techniques. Topics include:
- Introduction to x86 paradigms used exploited by these techniques
- Stack overflows including the classic stack smashing attack
- Ret2libc
- Format string exploits
- Heap overflows and metadata corruption attacks
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
Hadoop is everywhere these days, but it can seem like a complex, intimidating ecosystem to those who have yet to jump in.
In this hands-on workshop, Alex Dean, co-founder of Snowplow Analytics, will take you "from zero to Hadoop", showing you how to run a variety of simple (but powerful) Hadoop jobs on Elastic MapReduce, Amazon's hosted Hadoop service. Alex will start with a no-nonsense overview of what Hadoop is, explaining its strengths and weaknesses and why it's such a powerful platform for data warehouse practitioners. Then Alex will help get you setup with EMR and Amazon S3, before leading you through a very simple job in Pig, a simple language for writing Hadoop jobs. After this we will move onto writing a more advanced job in Scalding, Twitter's Scala API for writing Hadoop jobs. For our final job, we will consolidate everything we have learnt by building a more sophisticated job in Scalding.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
A revamped version of the Ansible intro talk from February 2015, brought up-to-date for the January Ansible meetup in Berlin.
Join our group: https://www.meetup.com/Ansible-Berlin
You probably think that PL/SQL is dull and ordinary programming language. Not so! Parts of it can be downright WEIRD. In this presentation, Steven offers what he considers to be some of the stranger nooks and crannies of the PL/SQL language, perhaps in the process making them a little bit less weird.
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
What we Learned Implementing Puppet at BackstopPuppet
"What We Learned Implementing Puppet at Backstop" by Bill Weiss at Puppet Camp Chicago 2013. Learn about upcoming Puppet Camps at http://puppetlabs.com/community/puppet-camp/
Come hear a brief overview on the direction the HPCC Systems platform is heading, and get a glimpse into some of the likely highlights included in the next minor and major versions.
This talk will explain the reasoning behind the release cycle changes, and how overcoming the challenges faced in the previous practice of automated testing has introduced new benefits and wider acceptance from the wider community.
Advancements in HPCC Systems Machine LearningHPCC Systems
This presentation will provide an overview of the latest advancements in Machine Learning modules over the past year, including Clustering, Natural Language Processing, Deep Learning, and the Expanded Model Evaluation Metrics.
Clustering Methods of the HPCC Systems Machine Learning Library
The clustering method is an important part of unsupervised learning. To gain the unsupervised learning capability, two widely applied clustering methods, KMeans and DBSCAN are adopted to the current Machine Learning library. This presentation will introduce the newly developed clustering algorithms and the evaluation methods.
Learn how to package the HPCC Systems Platform in a Docker container and deploy it locally, and build an HPCC Systems Platform AMI followed by an AWS deployment.
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
The training process for modern deep neural networks requires big data and large computational power. Though HPCC Systems excels at both of these, HPCC Systems is limited to utilizing the CPU only. It has been shown that GPU acceleration vastly improves Deep Learning training time. In this talk, Robert will explain how HPCC Systems became the first GPU accelerated library while also greatly expanding its deep neural network capabilities.
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
HPCC Systems offers parallelization by assuming data-independent tasks. For operations having complex predicates, such as the set similarity join (SSJ) predicates, this assumption might create a tradeoff. On the one hand, one may choose a data replication strategy with large intermediate data groups assuring that all intermediate results fit into main memory. However, this can lead to an underutilization of today's massively parallel CPUs. On the other hand, you may choose a higher degree of replication for better CPU utilization. Such a choice may lead to an overutilization of main memory on the compute nodes. Our research focuses on the parallel implementation of the set similarity join (SSJ) operator. This operator finds all pairs of records which have a similarity above a defined threshold using a similarity measure such as Jaccard. Our goal is a robust approach for executing SSJ that does not over-utilize memory and exploits CPU parallelization as much as possible. This approach requires data sharing between tasks/threads which is not foreseen in HPCC Systems so far. In this talk, we describe how we implemented multithreaded user-defined functions for the SSJ operator. To be able to control NUMAspecific parallelization conditions, we implemented a C++ plugin for HPCC Systems. Furthermore, we show how we visualized the relevant system parameters (CPU and memory usage). This talk is intended for anyone interested in extending HPCC Systems by plugins and monitoring distributed program execution.
DataPatterns - Profiling in ECL Watch HPCC Systems
DataPatterns.Profile() has been evolving since the last time you may have seen it. It does more. It looks better. It has been integrated into the ECL standard library and into ECL Watch. Learn what this data profiler can do for you and how its built-in visualization easily summarizes the results.
Join us for an introductory walk-through of using the Spark-HPCC Systems ecosystem to analyze your HPCC Systems data using a collaborative Apache Zeppelin notebook environment.
The Workunit Analyser examines the entire workunit to produce advice that both novices and experienced ECL developers should find useful. The Workunit Analyser is a post-execution analyser that identifies potential issues and assists users in writing better ECL.
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems
Join NSU University School student and program leader for girls in robotics, Ronnie Shashoua, as she talks about Gadget Girls - a project in collaboration with the NSU Alvin Sherman Library, NSU University School, the South Florida Girl Scouts, and sponsorship from the HPCC Systems Academic Program. Gadget Girls is a program aimed at encouraging girls in fourth and fifth grade to explore their interests in and love for STEM, especially robotics and engineering. Shashoua will discuss the underrepresentation of girls in the Florida Vex Robotics circuit, such as how it demonstrates a larger trend of low numbers of women undertaking STEM educational and career paths and the role it played in inspiring the creation of Gadget Girls.
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems
Hear how the Florida Atlantic University Center for Autism and Related Disabilities has partnered with the HPCC Systems community to provide young people with autism both the technology and professional skills needed to compete in today’s workplace. Mentoring and hands-on coding through ECL workshops have positively impacted students, opening doors to new opportunities for both students and employers.
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...HPCC Systems
Text cleaning is becoming an essential step in text classification. Stop word removal is a crucial space-saving technique in text cleaning which saves huge amounts of space in text indexing. There are many domain-based common words which differ from one domain to another and have no value within a particular domain. Eliminating these words will reduce the size of the corpus and enhance the performance of text mining. This talk will discuss how the text vectors bundle, (CBOW), in HPCC Systems was used to find the domain based common words, and the methodology applied to enhance the performance of classification methods.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
4. Engineers on big projects may need this level of control. But.
QAs Analysts
Developers
Data
Scientist
5. For these people, ECL syntax is a bit of a trial!
Dedup
• DEDUP(SORT(DISTRIBUTE(x, HASH(y)), x, LOCAL), x, LOCAL);
One column transform
• PROJECT(x, TRANSFORM(RECORDOF(LEFT), SELF.y := LEFT.y+1; SELF := LEFT;);
Named output
• OUTPUT(x, NAMED('x'));
Write to CSV
• OUTPUT(x, , '~ROB::TEMP::x', CSV(HEADING(SINGLE), SEPARATOR(','), TERMINATOR('n'),
QUOTE('"')));
Grouped count
• [I ran out of space]
Dapper – A Bundle to Make Your ECL Neater
6. How does this stuff work in other languages? Well, R is nice!
library(dplyr)
df <- read.csv('x')
df <- select(df, col1, col2)
df <- mutate(df, col3 = col1 +
col2)
df <- group_by(df, col3)
df <- summarise(df, col5 = n())
write.csv(df, file='output.csv')
Dapper – A Bundle to Make Your ECL Neater
7. How does this stuff work in other languages? Well, R is nice!
library(dplyr)
df <-
read.csv('x') %>%
select(col1, col2) %>%
mutate(col3 = col1 + col2)
%>%
group_by(col3) %>%
summarise(col5 = n()) %>%
write.csv(file='output.csv')
Dapper – A Bundle to Make Your ECL Neater
8. SQL is also lovely, but can be hard to arrange into a single call
SELECT COUNT(col2), col1 FROM TABLE GROUP BY
col1;
Dapper – A Bundle to Make Your ECL Neater
9. ….and Python is, as always, Python
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
…
Dapper – A Bundle to Make Your ECL Neater
13. View Data
//load data
StarWars :=
ExampleData.starwars;
// Look at the data
tt.nrows(StarWars);
tt.head(StarWars);
Dapper – A Bundle to Make Your ECL Neater
15. //Fill blank species with unknown
fillblankHome := tt.mutate(StarWars, species, IF(species = '', 'Unkn.', species));
tt.head(fillblankHome);
Fill in some blanks
Dapper – A Bundle to Make Your ECL Neater
16. That’s right, we don’t need LEFT or SELF!!!
What sorcery is this?!?!?
Dapper – A Bundle to Make Your ECL Neater
17. Okay, we now need to make our BMI column!
//make height meters
heightMeters := tt.mutate(fillblankHome, height, height/100);
//Create a BMI for each character
bmi := tt.append(heightMeters, REAL, BMI, mass/(height^2));
//Look at just the new column and name
bmiSelect := tt.select(bmi, 'name, bmi');
tt.head(bmiSelect);
Dapper – A Bundle to Make Your ECL Neater
18. Let's work through an example
Sort!
//Find the highest
sortedBMI := tt.arrange(bmiSelect, '-bmi');
tt.head(sortedBMI);
Dapper – A Bundle to Make Your ECL Neater
19. Lovely, I feel that’s
one of life’s great
questions
answered
I do of course
have other
questions on Star
Wars
20. Has anyone else noticed the lack of diversity in the SW
universe?
//How many of each species are there?
species := tt.countn(sortedBMI, 'species');
sortedspecies := tt.arrange(species, '-n');
tt.head(sortedspecies);
Dapper – A Bundle to Make Your ECL Neater
21. There are some pretty exciting eye colours though!
//Finally let's look at unique hair/eye colour combinations:
colourData := tt.select(StarWars, 'eye_color');
unqiueColours := tt.distinct(colourData, 'eye_color');
//see arrangedistinct() for fancy sort/dedup
tt.head(unqiueColours);
Dapper – A Bundle to Make Your ECL Neater
22. Let's work through an example
Save
//and save our results
tt.to_csv(sortedBMI, 'ROB::TEMP::STARWARSCSV');
tt.to_thor(sortedBMI, 'ROB::TEMP::STARWARS');
Dapper – A Bundle to Make Your ECL Neater
24. IMPORT dapper.ExampleData;
IMPORT dapper.TransformTools as tt;
//load data
StarWars := ExampleData.starwars;
// Look at the data
tt.nrows(StarWars);
tt.head(StarWars);
//Fill blank species with unknown
fillblankHome := tt.mutate(StarWars, species, IF(species = '', 'Unkn.',
species));
tt.head(fillblankHome);
//Create a BMI for each character
bmi := tt.append(fillblankHome, REAL, BMI, mass/height^2);
tt.head(bmi);
//Find the highest
sortedBMI := tt.arrange(bmi, '-bmi');
tt.head(sortedBMI);
//Jabba should probably go on a diet.
Dapper IMPORT dapper.ExampleData;
//load data
StarWars := ExampleData.starwars;
// Look at the data
OUTPUT(COUNT(StarWars), NAMED('COUNTstarWars'));
OUTPUT(StarWars, NAMED('starWars'));
//Fill blank species with unknown
//Create a BMI for each character
fillblankHomeAndBMI :=
PROJECT(StarWars,
TRANSFORM({RECORDOF(LEFT); REAL BMI;},
SELF.BMI := LEFT.mass / LEFT.Height^2;
SELF.species := IF(LEFT.species = '', 'Unkn.', LEFT.species);
SELF := LEFT;));
OUTPUT(fillblankHomeAndBMI, NAMED('fillblankHomeAndBMI'));
//Find the highest
sortedBMI := SORT(fillblankHomeAndBMI, -bmi);
OUTPUT(sortedBMI, NAMED('sortedBMI'));
//Jabba should probably go on a diet.
Base ECL
Dapper – A Bundle to Make Your ECL Neater
25. //How many of each species are there?
species := tt.countn(sortedBMI, 'species');
sortedspecies := tt.arrange(species, '-n');
tt.head(sortedspecies);
//Finally let's look at eye colour :
colourData := tt.select(StarWars, 'eye_color');
unqiueColours := tt.distinct(colourData, 'eye_color');
//see arrangedistinct() for fancy sort/dedup
tt.head(unqiueColours);
//and save our results
tt.to_csv(sortedBMI,
'ROB::TEMP::STARWARSCSV');
//How many of each species are there?
CountRec := RECORD
STRING Species := sortedBMI.species;
INTEGER n := COUNT(GROUP);
END;
species := TABLE(sortedBMI, CountRec, species);
sortedspecies := SORT(species, -n);
OUTPUT(sortedspecies, NAMED('sortedspecies'));
//Finally let's look at unique eye colour:
colourData := TABLE(sortedBMI, {eye_color});
unqiueColours := DEDUP(SORT(DISTRIBUTE(colourData,
HASH(eye_color)),
eye_color, LOCAL), eye_color, LOCAL);
OUTPUT(COUNT(unqiueColours), NAMED('COUNTunqiueColours'));
OUTPUT(unqiueColours, NAMED('unqiueColours'));
//and save our results
OUTPUT(sortedBMI, , 'ROB::TEMP::STARWARSCSV',
CSV(HEADING(SINGLE), SEPARATOR(','),
TERMINATOR('n'), QUOTE('"')));
Dapper
Base ECL
Dapper – A Bundle to Make Your ECL Neater
27. Interested? You can install from our GitHub:
ecl bundle install https://github.com/OdinProAgrica/dapper.git
There’s also a more in-depth walkthrough (and infographic)
here:
https://hpccsystems.com/blog/dapper-bundle
Similar projects? Yes, yes we have!
https://github.com/OdinProAgrica
Dapper – A Bundle to Make Your ECL Neater
28. Bonus deck! We would like to introduce you to hpycc
Dapper – A Bundle to Make Your ECL Neater
29. Hpycc is a Python package that builds on the ideas of Dapper
That is:
How can we make HPCC Systems more useable to the Data Scientist?
How can this translate to engineering and development?
Dapper – A Bundle to Make Your ECL Neater
30. Things I find overly taxing
• Spraying new data
• Running scripts that I can customise easily
• Getting the results of queries and files
• ECL dev when I’m offsite
Dapper – A Bundle to Make Your ECL Neater
31. What if you could run all this from a Python notebook?
Now you can!
Dapper – A Bundle to Make Your ECL Neater
32. For the purposes of this demo I’ve made a throwaway function
Dapper – A Bundle to Make Your ECL Neater
33. I’m dev-ing locally so I’ll need HPCC Systems running
…then create a connection to my server
Dapper – A Bundle to Make Your ECL Neater
34. Let’s grab the raw Star Wars dataset…
Dapper – A Bundle to Make Your ECL Neater
35. What if we have more than one output?
Dapper – A Bundle to Make Your ECL Neater
42. Interested? You can install from pypi:
pip install hpycc
There’s also a more info on our github:
Similar projects? Yes, yes we have!
https://github.com/OdinProAgrica
https://github.com/OdinProAgrica/hpycc
Dapper – A Bundle to Make Your ECL Neater
43. Watch this space for our most recent project: Wally!
Dapper – A Bundle to Make Your ECL Neater
44. A little flavour of what we have already…
Dapper – A Bundle to Make Your ECL Neater
45. Interested? You can install from our github:
pip install hpycc
There’s also a more info on our github:
Similar projects? Yes, yes we have!
https://github.com/OdinProAgrica
https://github.com/OdinProAgrica/wally
Dapper – A Bundle to Make Your ECL Neater
47. …we are also building a stringtools as part of the Dapper
bundle
IMPORT dapper.stringtools as st;
source := 'No1 e-xp-ec-t-s t809he [S]pammish ReQuIsiTion';
target := 'nobody expects the spanish inquisition';
Dapper – A Bundle to Make Your ECL Neater
48. …we are also building a stringtools as part of the Dapper
bundle
source := 'No1 e-xp-ec-t-s t809he [S]pammish ReQuIsiTion';
target := 'nobody expects the spanish inquisition';
Dapper – A Bundle to Make Your ECL Neater
49. …we are also building a stringtools as part of the Dapper
bundle
IMPORT STD;
source := 'No1 e-xp-ec-t-s t809he [S]pammish ReQuIsiTion';
target := 'nobody expects the spanish inquisition';
one := TRIM(std.Str.ToLowerCase(source), LEFT, RIGHT);
two := REGEXREPLACE('1', one, 'body');
three := REGEXREPLACE('[^a-z ]', two, '');
four := REGEXREPLACE('mm', three, 'n');
five := REGEXREPLACE('req', four, 'inq');
six := REGEXREPLACE('s+', five, ' ');
six;
Dapper – A Bundle to Make Your ECL Neater
50. …we are also building a stringtools as part of the Dapper
bundle
IMPORT dapper.stringtools as st;
source := 'No1 e-xp-ec-t-s t809he [S]pammish ReQuIsiTion';
target := 'nobody expects the spanish inquisition';
regexDS := DATASET([
{'1' , 'body'},
{'[^a-z ]', '' },
{'mm' , 'n' },
{'req' , 'inq' },
{'s+' , ' ' }
], {STRING Regex; STRING Repl;});
st.regexLoop(source, regexDS);
target;
Dapper – A Bundle to Make Your ECL Neater