Grehack2013-RuoAndo-Unraveling large scale geographical distribution of vulne...Ruo Ando
Feasibility study of large scale attacks of DNS. We have found 10,334,293 DNS servers in 34 hours of first measurement (2013/05/31 – 2013/06/02) and 30285322 DNS servers in 26 hours of second measurement (2013/07/05).
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Grehack2013-RuoAndo-Unraveling large scale geographical distribution of vulne...Ruo Ando
Feasibility study of large scale attacks of DNS. We have found 10,334,293 DNS servers in 34 hours of first measurement (2013/05/31 – 2013/06/02) and 30285322 DNS servers in 26 hours of second measurement (2013/07/05).
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Francesc Alted (UberResearch GmbH), “New Trends In Storing And Analyzing Large Data Silos With Python”.
Bio: Teacher, developer and consultant in a wide variety of business applications. Particularly interested in the field of very large databases, with special emphasis in squeezing the last drop of performance out of computer as whole, i.e. not only the CPU, but the memory and I/O subsystems.
In this deck from the DDN User Group at ISC 2018, Dr. Peter Clapham from the Sanger Institute presents: Enabling a Secure Multi-Tenant Environment for HPC.
Learn more: https://insidehpc.com/2018/06/ddn-acquires-lustre-business-unit-intel/
and
https://www.ddn.com/company/events/isc-user-group/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Amazon Web Services
MACPAC is a federal legislative branch agency tasked with reviewing state and federal Medicaid and Children's Health Insurance Program (CHIP) access and payment policies and making recommendations to Congress. By March 15 and again by June 15 each year, the agency produces a comprehensive report for Congress that compiles results from Medicaid and CHIP data sources for the 50 states and territories. The CIO of MACPAC wanted a secure, cost-effective, high performance platform that met their needs to crunch this large amount of health data. In this session, learn how MACPAC and 8KMiles helped set up the agency’s Big Data/HPC analytics platform on AWS using SAS analytics software.
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStackinside-BigData.com
In this deck from the DDN User Group at SC17, Dr. Peter Clapham from Wellcome Trust Sanger Institute presents: Experiences in providing secure multi-tenant Lustre access to OpenStack.
"If you need 10,000 cores to perform an extra layer of analysis in an hour, you have to scale a significant cluster to get answers quickly. You need a real solution that can address everything from very small to extremely large data sets.”
The Wellcome Trust Sanger Institute, a charitably funded genomic research center located in the United Kingdom, is a world leader in studying the impact of genetics and genomics on global health. Since its inception in 1993, Sanger Institute has developed new understanding of genomes and their role in biology while delivering some of the most important advances in genomic research.
Watch the video: https://insidehpc.com/2017/12/experiences-providing-secure-multi-tenant-lustre-access-openstack/
Learn more: http://www.ddn.com/customers/sanger-institute/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenStack is a cloud computing platform. OpenStack provides an Infrastructure as a Service
(IaaS) and constitutes of resources such as compute, storage and network resources. Resource
allocation in cloud environment deals with assigning available resources in cost effective manner.
Resource allocation in OpenStack is carried out by nova-scheduler. Logically scheduler allocates
compute, network and storage resources to instance requests made by users of OpenStack. For
efficient and cost-effective use of scheduler resource pool can be created and maintained which
contains best possible hosts available. This paper presents a time saving way for allocating resources
for large deployments of cloud in the form of virtual machines
Strata 2014 Talk:Tracking a Soccer Game with Big DataSrinath Perera
Mobile devices, sensors and GPS are driving the demand to handle big data in both batch and real time. This presentation discusses how we used complex event processing (CEP) and MapReduce based technologies to track and process data from a soccer match as part of the annual DEBS event processing challenge. In 2013, the challenge included a data set generated by a real soccer match in which sensors were placed in the soccer ball and players’ shoes. This session will review how we used CEP to implement DESB challenge and achieved throughput in excess of 100,000 events/sec. It also will examine how we extended the solution to conduct batch processing using business activity monitoring (BAM) using the same framework, enabling users to obtain both instant analytics as well as more detailed batch processing based results.
These are the slides for a talk I gave to the Fredericksburg Linux User Group about Bitcoin and cryptocurrency in general on 2014-02-22. Audio is forthcoming from one of the attendees as a podcast.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
Francesc Alted (UberResearch GmbH), “New Trends In Storing And Analyzing Large Data Silos With Python”.
Bio: Teacher, developer and consultant in a wide variety of business applications. Particularly interested in the field of very large databases, with special emphasis in squeezing the last drop of performance out of computer as whole, i.e. not only the CPU, but the memory and I/O subsystems.
In this deck from the DDN User Group at ISC 2018, Dr. Peter Clapham from the Sanger Institute presents: Enabling a Secure Multi-Tenant Environment for HPC.
Learn more: https://insidehpc.com/2018/06/ddn-acquires-lustre-business-unit-intel/
and
https://www.ddn.com/company/events/isc-user-group/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Amazon Web Services
MACPAC is a federal legislative branch agency tasked with reviewing state and federal Medicaid and Children's Health Insurance Program (CHIP) access and payment policies and making recommendations to Congress. By March 15 and again by June 15 each year, the agency produces a comprehensive report for Congress that compiles results from Medicaid and CHIP data sources for the 50 states and territories. The CIO of MACPAC wanted a secure, cost-effective, high performance platform that met their needs to crunch this large amount of health data. In this session, learn how MACPAC and 8KMiles helped set up the agency’s Big Data/HPC analytics platform on AWS using SAS analytics software.
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStackinside-BigData.com
In this deck from the DDN User Group at SC17, Dr. Peter Clapham from Wellcome Trust Sanger Institute presents: Experiences in providing secure multi-tenant Lustre access to OpenStack.
"If you need 10,000 cores to perform an extra layer of analysis in an hour, you have to scale a significant cluster to get answers quickly. You need a real solution that can address everything from very small to extremely large data sets.”
The Wellcome Trust Sanger Institute, a charitably funded genomic research center located in the United Kingdom, is a world leader in studying the impact of genetics and genomics on global health. Since its inception in 1993, Sanger Institute has developed new understanding of genomes and their role in biology while delivering some of the most important advances in genomic research.
Watch the video: https://insidehpc.com/2017/12/experiences-providing-secure-multi-tenant-lustre-access-openstack/
Learn more: http://www.ddn.com/customers/sanger-institute/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenStack is a cloud computing platform. OpenStack provides an Infrastructure as a Service
(IaaS) and constitutes of resources such as compute, storage and network resources. Resource
allocation in cloud environment deals with assigning available resources in cost effective manner.
Resource allocation in OpenStack is carried out by nova-scheduler. Logically scheduler allocates
compute, network and storage resources to instance requests made by users of OpenStack. For
efficient and cost-effective use of scheduler resource pool can be created and maintained which
contains best possible hosts available. This paper presents a time saving way for allocating resources
for large deployments of cloud in the form of virtual machines
Strata 2014 Talk:Tracking a Soccer Game with Big DataSrinath Perera
Mobile devices, sensors and GPS are driving the demand to handle big data in both batch and real time. This presentation discusses how we used complex event processing (CEP) and MapReduce based technologies to track and process data from a soccer match as part of the annual DEBS event processing challenge. In 2013, the challenge included a data set generated by a real soccer match in which sensors were placed in the soccer ball and players’ shoes. This session will review how we used CEP to implement DESB challenge and achieved throughput in excess of 100,000 events/sec. It also will examine how we extended the solution to conduct batch processing using business activity monitoring (BAM) using the same framework, enabling users to obtain both instant analytics as well as more detailed batch processing based results.
These are the slides for a talk I gave to the Fredericksburg Linux User Group about Bitcoin and cryptocurrency in general on 2014-02-22. Audio is forthcoming from one of the attendees as a podcast.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
To understand an application’s performance, first you have to know what to measure. That’s the easy part. How do you take those measurements? Store them? Analyze them? Get them to the people who need them? Well, that’s where things get complicated, especially in the high-traffic distributed systems of the modern web! Like careful scientists, we must observe our subjects without altering them, and we must report our findings quickly so that we have the data necessary to make smart choices about the health and growth of the system.
Let’s explore the lessons learned by engineers at one of the world’s top web companies in their quest to find meaning at 5 MB/s. We’ll discuss the tools and techniques that enable the collection, indexing, and analysis of billions or more datapoints each hour, and learn how these same approaches can empower your applications and your business, no matter the scale.
Microservices, Continuous Delivery, and Elasticsearch at Capital OneNoriaki Tatsumi
This presentation focuses on the implementation of Continuous Delivery and Microservices principles in Capital One’s
cybersecurity data platform – which ingests ~6 TB of data every day, and where Elasticsearch is a core component.
Title: Gemtalk Systems Product Roadmap
Speaker: Norm Green
Thu, August 21, 9:00am – 9:45am
Video Part1: https://www.youtube.com/watch?v=PTLzyjrml7g
Video Part2: https://www.youtube.com/watch?v=w9ya2-xRopM
Description
Norm Green started his career in 1989 at IBM in Toronto, Canada as a quality assurance engineer. In 1993, he moved to the DACS (Data Acquisition and Control System) team where he helped design and build site-wide data collection system in VisualWorks and GemStone/S .
In 1996, he joined GemStone Systems as a Senior Consultant and traveled the world helping GemStone/S customers be more successful.
Within GemStone Systems, Norm held several positions including Director of Professional Services and Director of Engineering.
The Heatmap - Why is Security Visualization so Hard?Raffael Marty
This presentation explores why it is so hard to come up with a security monitoring (or shall we call it security intelligence) approach that helps find sophisticated attackers in all the data collected. It explores the question of how to visualize a billion events. To do so, the presentation dives deeply into heatmaps - matrices - as an example of a simple type of visualization. While these heatmaps are very simple, they are incredibly versatile and help us think about the problem of security visualization. They help illustrate how data mining and user experience design help get a handle of the security visualization challenges - enabling us to gain deep insight for a number of security use-cases.
SAST, CWE, SEI CERT and other smart words from the information security worldAndrey Karpov
Do it right from the start (doesn’t work)
Follow company rules
Use “best practices”
Code Review
Couple development
Test-driven development (TDD)
Agile development
Tools
Container Performance Analysis Brendan Gregg, NetflixDocker, Inc.
Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using resource controls including cgroups. The interaction between containers can also be examined, and noisy neighbors either identified of exonerated. Performance tooling can also need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we're using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show how to successfully analyze performance in a Docker container environment, and navigate differences encountered.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...Imperva Incapsula
Mondrian, MySQL, Mongo, Casandra, Lucene. You name it, we tried it. As a startup looking for cost-efficient and scalable solutions to power our event processing and statistics backend, we gave almost every Big Data technology out there a go. What we learned from these experiences is that doing it yourself is better than using plug-and-play black box solutions.
This presentation details the building of Incapsula’s Big Data system as a case study, examining the requirements and the different evolutionary phases it went through before becoming what it is today.
We issued 20 young coders with smartphones pre-loaded with an app that gathered data on the network activity of the other apps they used. Their data was captured using the Python-based data portal CKAN, analysed with SciKit-Learn, then returned to them using Docker and the Ipython Notebook. Python also played a role in the reverse-engineering of some of the more interesting apps we discovered.
Similar to Optimal Strategies for Large-Scale Batch ETL Jobs (20)
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
3. Neustar
• Help the world’s most valuable brands
understand and target their consumers both
online and offline
• Maximize ROI on Ad spend
• Billions of user events per day, petabytes of data
3#EUdev3
5. Batch ETL
• Runs on schedule/ programmatically triggered
• Aim for complete utilization of cluster resources,
esp. memory and CPU
5#EUdev3
6. Why Batch?
• We care about historical state
• We don’t have SLA other than 1-3x daily delivery
• Efficient, tuned optimal use of resources, cost
efficiency
6#EUdev3
7. What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
7#EUdev3
8. The attribution problem
• At Neustar, we process large quantities of ad
events
• Familiar events like: impressions, clicks,
conversions
• Which impression/click contributed to
conversion?
8#EUdev3
9. Example attribution
• Alice goes to her favorite news site, and sees 3
ads – impressions
• She clicks on one of them that leads to Macy’s –
click
• She buys something on Macy’s – conversion
• Her purchase can be attributed to the click and
impression events
9#EUdev3
10. The approach
• Join conversions with impressions and clicks on
userId
• Go through each user and attribute conversions
to correct target event (impressions/clicks)
• Latest target events are valued more, so
timestamp matters
10#EUdev3
12. What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
12#EUdev3
13. Driver OOM
Exception in thread "map-output-dispatcher-12" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:615)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617)
13#EUdev3
14. Driver OOM
• Array of mapStatuses of size m, each status
contains info about how the block is used by
each reducer (n).
• m (mappers) x n (reducers)
14#EUdev3
Status1
reducer1
reducer2
Status2
reducer1
reducer2
15. Driver OOM
• 2 types of mapStatus: highly compressed vs
compressed
• HighlyCompressedMapStatus tracks reduce
partition average size, with bitmap tracking
which blocks are empty for each reducer
15#EUdev3
16. Driver OOM
• Reduce number of partitions on either side
• 300k x 75k 100k x 75k
16#EUdev3
17. Disable unnecessary GC
• spark.cleaner.periodicGC.interval
• GC cycles “stop the world”.
• Large heaps means longer GC
• Set to a long period (e.g. twice the length of your
job)
17#EUdev3
18. Disable unnecessary GC
• ContextCleaner uses weak references to keep
track of every RDD, ShuffleDependency, and
Broadcast, and registers when the objects go
out of scope
• periodicGCService is a single-thread executor
service that calls the JVM garbage collector
periodically
18#EUdev3
19. Allow extra time
• spark.rpc.askTimeout
• spark.network.timeout
• in case of GC, our heap size is so large, we will
exceed the timeout.
19#EUdev3
20. Spurious failures
• Reading from s3 can be flaky, especially when
reading millions of files
• Set spark.task.maxFailures higher than default
of 3
• We set to < 10 to ensure true errors propagate
out quickly
20#EUdev3
21. What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
21#EUdev3
22. The skew
• Extreme skew in data
• A few users have 100k+ events for 90 days. The
average user has < 50
• Executors dying due to handful of extremely
large partitions
22#EUdev3
23. The skew
• Out of 20.5B users, 20.2B have < 50 events
23#EUdev3
0
5E+09
1E+10
1.5E+10
2E+10
2.5E+10
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
110000
115000
120000
125000
130000
135000
140000
145000
150000
155000
161000
167000
172000
177000
183000
191000
197000
204000
211000
223000
248000
271000
305000
328000
393000
504000
#ofusers
# of events
count of users with number of events bucketed by 1000s
29. Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
29#EUdev3
30. Bloom Filter
• Space-efficient probabilistic data structure to test
whether an element is a member of a set
• Size mainly determined by number of items in
the filter, and the probability of false positives
• No false negatives!
• Broadcast filter out to executors
30#EUdev3
31. Bloom Filter
• Using a high false positive rate, still very good
filter
• P = 5% -> 80% filtered out
• Subsequent join much faster
31#EUdev3
32. Bloom Filter
• Tradeoff between accuracy & size
• We’ve had great success with Bloom Filters with
size of < 5G
• Experiment with Bloom Filters
32#EUdev3
33. Bloom Filter Applied
• For conversions of 50 billion, false positive rate
of 0.1%, filter size is 80GB
• False positive rate of 5%, filter size is 35GB
• Still too big
33#EUdev3
34. Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
34#EUdev3
36. Long tail: what is it doing?
• Look at executor threads during long tail
com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382)
com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65)
com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630)
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:209
)
org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239)
org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(Wri
tablePartitionedPairCollection.scala:56)
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scal
a:699)
36#EUdev3
37. Long tail: what is it doing?
• Mappers writing to shuffle space taking long
• Need to reduce data size before going into
shuffle
37#EUdev3
38. Long tail: what is it doing?
• Events in the long tail had almost identical
information, spread over time.
• For each user, if we retain just 1 event per hour,
at 90 days, it is around 2k events.
• However, this means we need to group by user
first, which requires a shuffle, which defeats the
whole purpose of this exercise, right?
38#EUdev3
39. Strategy: Filter during map side combine
• Use combineByKey and maximize map side
combine
• Thin collection out during map side combine ->
less is written to shuffle space
39#EUdev3
45. Avoid shuffles
• Denormalize data or union data to minimize
shuffle
• Rely on the fact we will reduce into a highly
compressed key space.
• For example, we want count of events by
campaign, also count of events by site
45#EUdev3
47. Coalesce partitions when loading
• Loading many small files – coalesce down # of
partitions
• No shuffle
• Reduce task overhead, greatly improve speed
• Going from 300k partitions to 60k, cut time by
half
47#EUdev3
48. Coalesce partitions when loading
final JavaRDD<Event> eventRDD= loadDataFromS3(); // load data
final int loadingPartitions = eventRDD.getNumPartitions(); // inspect
how many partitions
final int coalescePartitions = loadingPartitions / 5; // use algorithm
to calculate new #
eventRDD
.coalesce(coalescePartitions) // coalesce to smaller #
.map(e -> transform(e)) // faster subsequent operations
48#EUdev3
49. Materialize data
• Large chunk of data persisted in memory
• Large RDD used to calculate small RDD
• Use an Action to materialize the smaller
calculated result so larger data can be
unpersisted
49#EUdev3
50. Materialize data
parent.cache() // persist large parent PairRDD to memory
child1 = parent.reduceByKey(a).cache() // calculate child1 from parent
child2 = parent.reduceByKey(b).cache() // calculate child2 from parent
child1.count()// perform an Action
child2.count()// perform an Action
parent.unpersist() // safe to mark parent as unpersisted
// rest of the code can use memory
50#EUdev3
51. What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
51#EUdev3
52. Ganglia
• Ganglia is an extremely useful tool to
understand performance bottlenecks, and to
tune for highest cluster utilization
52#EUdev3
54. Ganglia: CPU wave
• Executors are going into GC multiple times in
the same stage
• Running out of execution memory
• Persist to StorageLevel.DISK_ONLY()
54#EUdev3
Today I’m going to share with you some tips and tricks to help you get started with processing large data in batch processes.
First, let me tell you a little bit about Neustar.
Neustar provides a range of cutting edge marketing solutions, which you can check out online at the above link.
On our team, we focus on helping the world’s most valuable brands
A quick word here on our stack,
Our jobs run on AWS EMR, and read and output data to S3. we’ve built infrastructure to support our data pipeline, so that we have a fast, highly fault tolerant, cloud based system.
All of our Spark jobs shown in the middle blue box are batch ETL jobs. Which is our focus today.
Our focus today is on Batch ETL jobs. Which differentiates itself from other use cases of Spark such as ad hoc data science uses, or streaming in the following ways.
Runs on schedule/ programmatically triggered. So it needs to be reliable and robust, humans will not be manually monitoring the job, or be able to tweak it during runs.
Secondly, we Aim for complete utilization of cluster resources, esp. memory and CPU. In streaming, depending on the workload, you might not care as much about using all of your machine resources at every moment. But in batch, the goals is to squeeze every bit of juice out of our machines, so we can arrive at our result with minimal cost.
We’re getting late arriving data all the time, hence historical state.
We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon.
By skew we mean when data is distributed in a way where outliers really affect performance.
Let’s use a specific problem to get started. This example is relevant to our business, and we will learn about spark in the process.
At Neustar, we process large quantities of ad events. Some examples include impressions, clicks, conversions.
The problem we’re going to solve is, which impression or click events contributed to the conversion happening?
A user usually has many events, so multiple conversions and multiple target events. We need to correctly attributed each of the conversions events for each user.
Let’s use a concrete example:
Let’s use a concrete example.
How would we solve this in code?
Attribute correctly based on timestamp, the target event has been occur before the conversion.
We care about the latest target events.
Also based on metadata to ensure they are for the same advertiser etc.
For 90 days of events: we are loading around 100 T of data.
For the maximum join, we are joining 50 billion conversions with 250 billion impressions
Let’s talk about issues you will encounter only at scale
When working at this scale, you will see errors that other people won’t see. For example, driver OOM.
that stacktrace is not actually that JVM is out of memory, your metrics will say you have plenty of heap left. It’s java's ability to allocate an array, max size of array is 2B. we have more than 2G of map status outputs.
The byte array out stream cannot grow any more. You can see that it Overflowed the buffer in the ByteArrayOutputStream.
What is contained in this output stream?
We are serializing statuses, which is an array of MapStatuses. The size of the array is the number of map tasks. Each status contains information about how the map block is used by each reducer.
We can already see, this is a m times n problem.
Spark has 2 types of mapStatues:
If the number of partitions exceeds 2000, the HighlyCompressedMapStatus is used.
If the data also happens to be distributed in a way that would prevent the bitmap from being highly compressible, for example, when each map job output goes to a random set of reducers, with some reducers getting nothing, and some getting output, you can create a situation where your buffer will not be big enough for the compressed statuses!
Easiest solution is to reduce partitions #
Luckily the solution is also transparent.
For Batch jobs we don’t have a long running application. Having long GC cycles is a waste of resources on our clusters, so we tweak our jobs and cluster settings so we can avoid large amounts of GC.
We’re working with large data, which means longer GC when it happens.
The setting to tweak here is the spark.cleaner.periodicGC.interval.
The cleaning thread will block on cleanup tasks, for example when the driver performs a GC and cleans up all broadcasts.
For larger jobs, we increase the periodic GC on the driver so that it is effectively disabled. For this job we set it to 600 minutes
Java’s newer G1 GC completely changes the traditional approach. The heap is partitioned into a set of equal-sized heap regions, each a contiguous range of virtual memory (Figure 2). Certain region sets are assigned the same roles (Eden, survivor, old) as in the older collectors, but there is not a fixed size for them.
minor GC occurs, G1 copies live objects from one or more regions of the heap to a single region on the heap
Full GC occurs only when all regions hold live objects and no full-empty region can be found.
greatly improves heap occupancy rate when full GC is triggered, but also makes the minor GC pause times more controllable
G1GC – low latency high throughput.
we need Spark to be a little bit more patient when dealing with larger jobs, so we ask that it waits longer for communication between machines
in case one of the GC, because our heap size is so large, we will exceeds the timeout.
Protection against spurious failures which can occur. Spark default maxFailures per task is 3, not enough in large jobs.
On the other hand, we don’t want to set it too high, or else, true errors will get retried forever, preventing timely response, and is costly.
We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon.
At this scale, you’re more likely than not to encounter skew. We had extreme skew in our data.
Overwhelming majority of our users have fewer than 1000 events. In fact, most have fewer than 50 events.
Upon closer inspection, our data had very few extreme outliers, out of 20B, there were only 879 that were greater than 75k
these very few bad partitions were causing our executors to die
Executors weren’t as likely to die anymore by grouping in lists.
We increased partitions and nested our data, and our executors weren’t dying anymore!
But there were still inefficiencies caused by skew. Notice the lag in CPU use in ganglia. All of those resources were wasted because there were a few bad partitions.
The max was taking 50 minutes vs the median of 24 seconds!
We need to do something about this.
If you have domain specific knowledge of your data, use it to filter “bad” data out as early as you can
We couldn’t do this in the naïve way because we had no idea which users were going to be bad
Salt your data, and join twice. In our empirical studies, this was never worth it since shuffle is so expensive.
Use bloom filter if one side of your join is much smaller than the other
Bloom filters quickly become large after certain size. For example, if we had xxx conversions, the bloom filter for a 0.1% false positive rate would have been xxx . To broad cast this out and filter would have used more resources than to directly join
----- Meeting Notes (9/25/17 14:38) -----
If
The size mainly determined by number of items in the filter, and the probability of false positives. The more items you have the bigger the filter. The more accurate the filter is, the bigger the filter
The filter can be broadcast out to executors, and used in tasks.
uses a bittorrest algo broadcast into slices (couple hundred pieces). overlay a couple of gigabytes.
bloom filter: hash functions, more hash functions you can have accuacy.
bitset.
hash(x) && hash2(x) && hash3(x)
One thing to note is, Don’t be afraid of using a high false positive rate. If our goal is to shrink the joinable set down, using a 5% false positive rate results in 80% of the data filtered out. This makes subsequent joins much faster
However, we’ve had great success in many other jobs where the join was even more lopsided. Please experiment and see if it could work for you.
Now going back to our original problem. You can find bloom filter size calculators online to find out if Bloom filter is a good solution for you.
Unfortunately, both side of our join were sizable, and the bloom filter solution was not the right solution for this job.
None of these strategies worked directly for us, so we were still left with skew.
Let’s look at the ugly long tail graph again.
Let’s go back to our problem. We have a very long tail.
It is always a good idea to understand what your threads are spending time on, and for long tail, we are especially interested.
On our r3.4xls with 16 cores, we saw that most threads were stuck on writing. it was spending a lot of time writing to shuffle space, so let’s try to reduce the size to be written
On our r3.4xls with 16 cores, we saw that most threads were stuck on writing. it was spending a lot of time writing to shuffle space, so let’s try to reduce the size to be written
Let’s go back to the data. Inspecting the event data revealed to us that in the long tail, the events had almost identical information, spread over time. If we retain just 1 event per hour, at 90 days, it is at most 2160 events per user. However, this seems to mean that we need to group by userId first, which requires a shuffle, which defeats the whole purpose of this exercise, right? This is where map side combine comes in.
Let’s go back to the data. Inspecting the event data revealed to us that in the long tail, the events had almost identical information, spread over time. If we retain just 1 event per hour, at 90 days, it is at most 2160 events per user. However, this seems to mean that we need to group by userId first, which requires a shuffle, which defeats the whole purpose of this exercise, right? This is where map side combine comes in.
If we use CombineByKey, we can specify the operation to perform on the map side. Here we can thin out the collection.
The basic structure is shown here. Notice in the second lambda, we rate limit/ thin out collection when list size reaches a certain max. Feel free to check out the slides, they will be available online.
We got rid of the long tail, but my job is still generally slow, can I squeeze more performance out of it by tweaking my job?
The answer is yes.
We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon.
Reusing the partitioner allows all the data to be partitioned the same way, and ready for joins/reduceByKeys. This way we can transform multiple wide transformations into a single narrow transformation
Please don’t try to read this, what I want to point out here, is this long narrow stage. It contains 3 combineByKeys, and 2 left outer joins in the same narrow stage. If we had used different partitioners, each would induce a shuffle.
But we can go further than that to reduce the amount of shuffles. We are willing to pay for it in memory.
Let’s say we want get count of users by campaign, and we also want a count of users by site.
The straightforward way to do this is to take the RDD, cache it, and reduceByKey on it twice.
But if we duplicate the rows like this, we are able to reduceByKey once and then filter for the two desired results.
Another strategy that greatly improves performance is coalescing # of partitions down.
A simplistic example would be this. We load data from S3, calculate the original partition number, use an algorithm that makes sense for your data to find the new partition #, then coalesce your data down to new # of partitions.
Subsequent operations will be more efficient
Do a count, or a reduce.
spark is lazy, that you have to materialize first. or else spark will load from source again.
We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon.
Let’s go through two ganglia use cases. It’s available in AWS EMR.
What happens when you see waves occurring within a single stage? Sometimes the waves to longer, and takes longer than 10 minutes from crest to trough.
DISK ONLY caching, it does depend on the machine class. r4 it's not that great. on board SSD is good, EBS is not that great. Tune and test.
Notice the last stage is only using 70% of CPU for the entire process.
Decrease number partitions of RDD used in this stage. Less total overhead, fewer tasks means each task computes more data, and this helps with CPU utilization
After tweaking the above 2 properties, we get much better cluster usage!
We’ve gone though most of these configurations today, the ones I didn’t mention Spark does a good job notifying users to set when needed. Feel free to reach out to me with any questions about this after this session.
spark.dynamicAllocation.enabled FALSE not multi tenented
spark.driver.maxResultSize 8g set to really high, since our driver is large. We can handle lots of results going to the driver
spark.rpc.message.maxSize 2047 if you have many map and reduce tasks, use max. used to send out mapStatues, which we know is m x n
Overhead because JVM overhead. Protect against Executor OOM. what goes into the memory overhead? Intern pool, everything off the JVM. Thread stacks, NIO buffers. Shared native libraries.
Yarn coordination process takes 896MB (you can find it on yarn application page)
Why not dataframes?
Complicated logic difficult to express in SQL. For example, finding the latest events that fit specific criteria after a join, or rate limiting data by hour boundaries. Partitions, partitioners, combine by key.
Customization of code. Really large data requires specific partition settings, map side combine optimizations, bloom filters in each type of stage. More difficult to do with dataframes.
dataframes are great, but RDD gives us flexibility to exploit patterns exist in the data. In R&O we were able to go from down mutiple reduces with data skew. to 1 single pass with RDDS. DF is great to general data analysis tool, RDDs fantastic for power users.
6x difference in size between DF vs RDDs.