At the Dublin Fashion Insights Centre, we are exploring methods of categorising the web into a set of known fashion related topics. This raises questions such as: How many fashion related topics are there? How closely are they related to each other, or to other non-fashion topics? Furthermore, what topic hierarchies exist in this landscape? Using Clojure and MLlib to harness the data available from crowd-sourced websites such as DMOZ (a categorisation of millions of websites) and Common Crawl (a monthly crawl of billions of websites), we are answering these questions to understand fashion in a quantitative manner.
The latest generation of big data tools such as Apache Spark routinely handle petabytes of data while also addressing real-world realities like node and network failures. Spark's transformations and operations on data sets are a natural fit with Clojure's everyday use of transformations and reductions. Spark MLlib's excellent implementations of distributed machine learning algorithms puts the power of large-scale analytics in the hands of Clojure developers. At Zalando's Dublin Fashion Insights Centre, we're using the Clojure bindings to Spark and MLlib to answer fashion-related questions that until recently have been nearly impossible to answer quantitatively.
Hunter Kelly @retnuh
tech.zalando.com
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sewz2m
This CloudxLab Key-Value RDD tutorial helps you to understand Key-Value RDD in detail. Below are the topics covered in this tutorial:
1) Spark Key-Value RDD
2) Creating Key-Value Pair RDDs
3) Transformations on Pair RDDs - reduceByKey(func)
4) Count Word Frequency in a File using Spark
• Distributed datasets loaded into named columns (similar to relational DBs or
Python DataFrames).
• Can be constructed from existing RDDs or external data sources.
• Can scale from small datasets to TBs/PBs on multi-node Spark clusters.
• APIs available in Python, Java, Scala and R.
• Bytecode generation and optimization using Catalyst Optimizer.
• Simpler DSL to perform complex and data heavy operations.
• Faster runtime performance than vanilla RDDs.
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sewz2m
This CloudxLab Key-Value RDD tutorial helps you to understand Key-Value RDD in detail. Below are the topics covered in this tutorial:
1) Spark Key-Value RDD
2) Creating Key-Value Pair RDDs
3) Transformations on Pair RDDs - reduceByKey(func)
4) Count Word Frequency in a File using Spark
• Distributed datasets loaded into named columns (similar to relational DBs or
Python DataFrames).
• Can be constructed from existing RDDs or external data sources.
• Can scale from small datasets to TBs/PBs on multi-node Spark clusters.
• APIs available in Python, Java, Scala and R.
• Bytecode generation and optimization using Catalyst Optimizer.
• Simpler DSL to perform complex and data heavy operations.
• Faster runtime performance than vanilla RDDs.
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=68942cd0-6714-4753-a218-20d4b48da07d)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694
Spark Schema For Free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
How We Made our Tech Organization and Architecture Converge Towards ScalabilityZalando Technology
In this talk, Dan Persa shared the recipe we have applied to scale our tech team to more than 1,000 people, while redesigning the architecture of our Online Fashion Shop -- Project Mosaic -- to make more than 18 million customers happier. He also shared Zalando Tech’s learnings and takeaways in order to become both more successful and more customer centric.
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=68942cd0-6714-4753-a218-20d4b48da07d)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694
Spark Schema For Free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
How We Made our Tech Organization and Architecture Converge Towards ScalabilityZalando Technology
In this talk, Dan Persa shared the recipe we have applied to scale our tech team to more than 1,000 people, while redesigning the architecture of our Online Fashion Shop -- Project Mosaic -- to make more than 18 million customers happier. He also shared Zalando Tech’s learnings and takeaways in order to become both more successful and more customer centric.
Zalando's Jan Mußler points out how Docker helps to foster team autonomy while complying with company regulations. As an example he shows how Zalando Tech deploys onto its microservices infrastructure by using their open source platform STUPS (stups.io) along with Docker in their continuous delivery strategy.
https://tech.zalando.com
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
In this talk we present Zalando's microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing for near-real time business intelligence.
Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization.
We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach - Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams.
We no longer live in a world of static data sets, but are instead confronted with an endless stream of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch.
With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL.
Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements.
On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability.
Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.
Building a Reactive RESTful API with Akka Http & SlickZalando Technology
Dan Persa, Senior Software Engineer at Zalando
Dan Persa has been a software engineer at Zalando since 2013 and is a member of the Fashion Store team, which is responsible for Zalando’s core ecommerce business. He loves Java and Scala and more recently has been exploring Go and Node.js. He’s a big fan of Clean Code and Software Craftsmanship. In addition to coding, he enjoys mentoring new developers, organizing coder dojos and reading groups and giving tech talks. In his free time he likes to take photos and dance salsa.
tech.zalando.com
The Glamorous Toolkit: Towards a novel live IDEESUG
Youtube: https://youtu.be/XWOOJa3kEa0
The Glamorous Toolkit project aims to reinvent the IDE (http://gtoolkit.org). Over the last two years the team has produced four significant tools that are part of Pharo: Playground, Inspector, Spotter, Debugger. In this demo-driven talk we exemplify how these tools can change the development workflow, and we sketch the broader perspective and outlook of the project.
BIO:
Tudor Gîrba (http://tudorgirba.com) founded feenk gmbh, a consulting and coaching company (http://feenk.com), and in partnership with Eliot Miranda helps companies adopt Pharo.
He leads the work on the Moose platform for software and data analysis (http://moosetechnology.org), he founded the Glamorous Toolkit project for rethinking the IDE (http://gtoolkit.org), and he is a board member of the Pharo live programming environment (http://pharo.org).
He authored the humane assessment method (http://humane-assessment.com) to help teams to rethink the way they manage large software systems and data sets. Tudor also argues that storytelling should be prominent in software development (http://demodriven.com).
In 2014, he won the prestigious Dahl-Nygaard Junior Prize (http://aito.org) for his work on modeling and visualization of evolution and interplay of large numbers of objects.
Title: PharoJS
Mon, August 22, 3:00pm – 3:30pm
Youtube: https://youtu.be/nmRPSb0t9lw
Author1: Noury Bouraqadi
Author2: Dave Mason
Email where you can always be reached: noury.bouraqadi@mines-douai.fr and dmason@ryerson.ca
Type: Talk
Bio:
- Noury Bouraqadi is a full professor at Ecole des Mines de Douai,
France. His research is about Software Engineering and AI with a
focus on robotic applications.
- Dave Mason is a full professor at Ryerson University, Toronto,
Canada. His research interests are: software reliability, software
engineering, compilers, programming languages, concurrency, formal
verification, operating systems. His primary research focus at the
moment is Programming for the Rest of Us - creating an environment
for anybody to be able to transform the data swarming around them
into useful information.
Abstract:
PharoJS is an infrastructure (framework + middleware + tools) that allows developping and
testing in Smalltalk for applications that will ultimately run on a Javascript interpreter.
Unlike to Amber, that runs inside a web browser, PharoJS is built on top of Pharo.
Apps are initially buit in the image as pure Pharo objects that run on the Smalltalk virtual machine.
The PharoJS middleware allows interacting remotely with third party Javascript objects running
on a web browser, or interacting with a web view, particularly for mobile apps.
Such interactions are used only during tests.
Ultimately, the Smalltalk code is converted to Javascript.
So, at the production stage, only a Javascript interpreter is required to run the app.
In this talk, we present the current status of PharoJS and its implementation.
Through an example, we describe the application development process, with a focus on tests and TDD.
Auto-scaling your API: Insights and Tips from the Zalando TeamZalando Technology
A presentation given by Zalando software engineers Sean Floyd and Luis Mineiro at the JBCNConf in Barcelona, Spain, June 2015. Zalando is Europe's largest online fashion platform, doing business in 15 countries with more than 15 million users. Visit tech.zalando.com for more information about Zalando's technology, open source projects and opportunities.
Berlin Apache Flink Meetup, May 2016
In this talk we present Zalando's microservices architecture and introduce Saiki – our next generation data integration and distribution platform on AWS. We show why we chose Apache Flink to serve as our stream processing framework and describe how we employ it for our current use cases: business process monitoring and continuous ETL. We then have an outlook on future use cases.
By Javier Lopez & Mihail Vieru, Zalando, Zalando SE
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnZalando Technology
We had the great pleasure of hosting a talk by Dr. Roland Kuhn: leader of Typesafe’s Akka project, and coauthor of the book Reactive Design Patterns and the Reactive Manifesto. For a standing-room-only crowd, Roland highlighted the importance of making reactive software: of considering responsiveness, maintainability, elasticity and scalability from the outset of development. He explored several architecture elements that are commonly found in reactive systems, such as the circuit breaker, various replication techniques, and flow control protocols. These patterns are language-agnostic and also independent of the abundant choice of reactive programming frameworks and libraries. Check out his slides!
At Zalando we use Camunda BPM for order execution. In this talk we provide a sneak peek behind the scenes of our order processing and show how we achieve horizontal scalability in our architecture.
There are many ways to run high availability with PostgreSQL. Here, we present a template for you to create your own customized, high-availability solution using Python and for maximum accessibility, a distributed configuration store like ZooKeeper or etcd.
Radical Agility with Autonomous Teams and Microservices in the CloudZalando Technology
A talk by software engineers Jan Löffler and Henning Jacobs on Zalando's adoption of microservices, cloud computing and autonomous teams. Zalando is Europe's largest online fashion platform, doing business in 15 countries with more than 15 million users. Visit tech.zalando.com for more information about Zalando's technology, open source projects and opportunities.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
These are the outline slides that I used for the Pune Clojure Course.
The slides may not be much useful standalone, but I have uploaded them for reference.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
Designing a database like an archaeologistyoavrubin
Talk given at the Israel Clojure meetup group, based on the chapter in "500 Lines or less" http://aosabook.org/en/500L/an-archaeology-inspired-database.html
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Workshop on command line tools - day 2Leandro Lima
Slides of the I Workshop on command-line tools with the collaboration of CAG (Center for Applied Genomics - Children's Hospital of Philadelphia) bioinformatics analysts.
2nd day
I presented this talk a while back, at S4 Fall 2012.
S4 is a San Francisco/Bay Area local meetup event for security professionals. Check out the past events here.
http://s4con.blogspot.com/
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
7. ❖ What are we actually doing?
➢ Mining web pages for insights
❖ How?
➢ Using Machine Learning to do heavy
lifting
■ Use Classifiers to filter/bucket the
data
■ Build Topic Models to try to discover
concepts related to words
8. ❖ Getting Data
➢ DMOZ
➢ Common Crawl
❖ Manipulating Data
➢ Spark
➢ Sparkling
■ RDDs
■ DataFrames
❖ Data Science
➢ MLLib
➢ Classification - Random Forests™
➢ LDA (Latent Dirichlet Allocation)
10. ❖ DMOZ
➢ “The largest human edited directory of the
web”
➢ Useful when you think of it in terms of
“free crowdsourced labeled data”
➢ Fairly ancient, borderline decrepit
➢ Crowdsourced is a double edged sword
11. ❖ Common Crawl (CC)
➢ “an open repository of web crawl data
that can be accessed and analyzed by
anyone.”
➢ Monthly crawls
➢ Readily accessible index
➢ Tons of free data - raw, links, plain text
formats
12. ❖ How to use them together!
➢ Use DMOZ to samples of positive and
negative “seed links”
➢ Lookup and expand your “seed links”
using CC index
➢ Fetch your data with little/no fuss using
CC index information
14. ❖ Apache Spark
➢ The “next big thing”
➢ Or arguably the “current” big thing
❖ Sparkling
➢ Clojure bindings to Spark
➢ Great Presentation (highly recommended)
➢ RDDs
➢ DataFrames
16. ❖ RDDs
➢ Resilient Distributed Datasets
➢ Easy to think of them as partitioned (or
sharded) seqs
➢ Transformations (map, filter, etc) are lazy
➢ Operations (count, collect, reduce, etc)
cause evaluation
➢ Very familiar paradigms for Clojure
programmers
20. ❖ A Historical Tangent
➢ “Those who cannot remember the past
are condemned to repeat it.”
➢ ~15 years ago, everything is running
MySQL, Oracle, etc.
➢ ~7 years ago everyone abandoning
SQL+RDBMS for NoSQL
➢ Now looping back to SQL - Spark SQL,
Google F1, etc.
22. ❖ DataFrames
➢ DataFrames are the new hotness
➢ It’s how Python and R can now achieve
similar speeds
➢ The Catalyst execution engine can plan
intelligently - behind the scenes,
generates source code, heavy use of
Scala macros, optimize away
boxing/unboxing calls, etc.
➢ Focus is clearly on DataFrames and
upcoming DataSets
23. ❖ DataFrames (cont)
➢ Great in Scala, not so much via JVM
interop
➢ Heavy use of Scala magic like implicits,
etc.
➢ Working with DataFrames from Clojure
can be… less than pleasant
➢ Scala folks really like their static, declared
types
➢ Going to get worse with DataSets
28. ❖ Machine Learning Key Points
➢ Uses statistical methods on large
amounts of data to hopefully gain insights
➢ Uses vectors of numbers extracted (by
you) from your data - “feature vectors”
➢ Classification puts things into buckets, i.e.
“fashion related website” vs. “everything
else”
➢ Topic modeling - way of finding patterns in
a bunch of documents - a “corpus”
30. ❖ MLLib
➢ Spark’s Machine Learning (ML) library
➢ “Its goal is to make practical machine
learning scalable and easy”
➢ Divides into two packages:
■ spark.mllib - built on top of RDDs
■ spark.ml - built on top of DataFrames
31. ❖ MLLib (cont)
➢ All the basics - Vectors, Sparse Vectors,
LabeledPoints, etc.
➢ A good variety of algorithms, all designed
for running in parallel
➢ Well documented
➢ Large community
34. ❖ Example - Metrics
➢ BinaryClassificationMetrics has some
useful things, but not basic things
➢ Have to use MulticlassMetrics for some of
the most wanted metrics, even on a
binary classifier
➢ Neither actually give you the count of
items by label - but
BinaryClassificationMetrics logs it to INFO
➢ End up iterating your data 3 (!) times to
get all desired metrics
36. ❖ Examples - Eye on the prize?
➢ HashingTF - oh boy
■ Lose all access to original word
■ Uses gigantic Array instead of a
HashMap
➢ ChiSqSelector - used to select top N
features
■ but how do we determine N? Can’t ask
■ End up grubbing around in the source
to find uses Statistics/chiSqTest
39. ❖ Classification
➢ Using lots of data to tell things apart
➢ Can put stuff into two buckets (or
“classes”) - Binary Classifier
➢ Or into many buckets - Multi-class
Classifier
➢ Lots of different techniques
➢ Supervised learning - each sample needs:
■ “features” - a vector of numeric data
■ “label” - a label specifying its class
40. ❖ The Bag of Words
➢ We started with very basic word cleansing
- lowercase, remove non letters/digits, 3
char min length, drop things just numbers
➢ Managed to make it this far in talk without
having to use word count!
➢ But ultimately most Data Science/ML
tasks involving text ends up heavily
dependent on word count
41. ❖ The Bag of Words (cont)
➢ Ended up with too many words (1.3M)
even on sample
➢ Were working on bare baseline, so no
stopword removal or stemming, following
KISS principle
➢ We did say must occur on >= 5 distinct
sites (not documents), reduced size to
460k words
42. (defn create-bow-site-occurance [json-lines-rdd]
(->> json-lines-rdd
(spark/map-to-pair
(fn [m] (spark/tuple (site (:url m))
(set (clean-word-seq (:raw_text m))))))
(spark/reduce-by-key union)
(spark/flat-map-to-pair
(s-de/key-value-fn
(fn [site words] (map spark/tuple words (repeat 1)))))
(spark/reduce-by-key +)
(spark/filter
(s-de/key-value-fn
(fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT))))
spark/sort-by-key))
Bag of Words
43. ❖ Random Forests™
➢ Ensemble of Decision Trees
➢ Uses “bootstrapping” for selection of
feature set and training set
➢ Not “Deep Learning” but extremely easy
to use and very effective
➢ “Any sufficiently advanced technology is
indistinguishable from magic.”
➢ Able to get pretty decent results! F-
measure 0.86
47. ❖ LDA - Latent Dirichlet Allocation
➢ Topic Model which infers topics from text
corpus
➢ Topics -> cluster centers, docs -> rows
➢ Features are vectors of word counts (Bag
of Words)
➢ Unsupervised Learning technique (but
you do supply the topic count)
48. ❖ LDA (cont)
➢ Quite tetchy to run at large scale
➢ OutOfMemory error on executors
➢ Job aborted due to stage failure: Serialized task 4341:0 was
365752339 bytes, which exceeds max allowed: spark.akka.frameSize
(134217728 bytes) - reserved (204800 bytes). Consider increasing
spark.akka.frameSize or using broadcast variables for large values.
➢ WTF?
➢ BTW, do not ever change “spark.akka.
frameSize”...
49.
50. ❖ LDA (moar cont)
➢ Finally able to get a trained model after
reducing BoW to more manageable size
~11k down from ~160k
➢ Trained on ~100k documents, roughly
even split between fashion/non-fashion
➢ These models for demonstration
purposes, moar fanciness planned
54. ❖ So what did we do?
➢ We took pre-scraped, “pre-labeled” data
➢ Used Clojure and Spark/Sparkling to
munge the data
➢ Used state of the art ML tools to analyze
the data
➢ Explored for insights
55. ❖ So what can YOU do?
➢ This will work for almost ANY domain
➢ There’s a lot of interesting information
even at this stage
➢ There’s a ton of interesting directions this
can go
■ Run classifier over all of CC data
■ Build domain-specific LDA models
➢ Do cool things and have fun doing it!