During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
Abstract:- Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
Abstract:- Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetupaaamase
Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
SpringPeople - Introduction to Cloud ComputingSpringPeople
Cloud computing is no longer a fad that is going around. It is for real and is perhaps the most talked about subject. Various players in the cloud eco-system have provided a definition that is closely aligned to their sweet spot –let it be infrastructure, platforms or applications.
This presentation will provide an exposure of a variety of cloud computing techniques, architecture, technology options to the participants and in general will familiarize cloud fundamentals in a holistic manner spanning all dimensions such as cost, operations, technology etc
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetupaaamase
Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
SpringPeople - Introduction to Cloud ComputingSpringPeople
Cloud computing is no longer a fad that is going around. It is for real and is perhaps the most talked about subject. Various players in the cloud eco-system have provided a definition that is closely aligned to their sweet spot –let it be infrastructure, platforms or applications.
This presentation will provide an exposure of a variety of cloud computing techniques, architecture, technology options to the participants and in general will familiarize cloud fundamentals in a holistic manner spanning all dimensions such as cost, operations, technology etc
Data massage: How databases have been scaled from one to one million nodesUlf Wendel
A workshop from the PHP Summit 2013, Berlin.
Join me on a journey to scaling databases from one to one million nodes. The adventure begins in the 1960th and ends with Google Spanner details from a Google engineer's talk given as late as November 25th, 2013!
Contents: Relational systems and caching (briefly), what CAP means, Overlay networks, Distributed Hash Tables (Chord), Amazon Dynamo, Riak 2.0 including CRDT, BigTable (Distributed File System, Distributed Locking Service), HBase (Hive, Presto, Impala, ...), Google Spanner and how their unique TrueTime API enables ACID, what CAP really means to ACID transactions (and the NoSQL marketing fuzz), the latest impact of NoSQL on the RDBMS world. There're quite a bit of theory in the talk, but that's how things go when you walk between Distributed Systems Theory and Theory of Parallel and Distributed Databases, such as.... Two-Phase Commit, Two-Phase Locking, Virtual Synchrony, Atomic Broadcast, FLP Impossibility Theorem, Paxos, Co-Location and data models...
The document talks about the overview behind the need and drive for NoSQL databases. It also mentions about some of the most popular NoSQL databases in the market.
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
Performance Optimization of Cloud Based Applications by Peter Smith, ACLTriNimbus
Peter Smith, PhD, Principal Software Engineer at ACL talks about Performance Optimization of Cloud Based Applications at TriNimbus' 2017 Canadian Executive Cloud & DevOps summit in Vancouver
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...confluent
We use machine learning to delve deep into the internals of how systems like Kafka work. In this talk I'll dive into what variables affect performance and reliability, including previously unknown leading indicators of major performance problems, failure conditions and how to tune for specific use cases. I'll cover some of the specific methodology we use, including Bayesian optimization, and reinforcement learning. I'll also talk about our own internal infrastructure that makes heavy use of Kafka and Kubernetes to deliver real-time predictions to our customers.
Similar to Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems (20)
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
2. 1. Overview of Big Data Ingest
2. Real world examples with lessons interleaved
3. A summary of lessons learned and extra ideas
Agenda
3. Big Data Ingest
Ingesting from different data
sources is the goal
Several data sources have
different structures, but
schemas vary mostly
Batch and Real Time ingest
both have their places
Data sources Schema Speed
5. Schema
One schema with a relatively
flat structure or many
schemas with nested
structures.
Immutable schemas can’t be
changed. Mutable schemas
can evolve. Nested schemas
can also have mutability
properties.
Number of schemas Mutability Inference
Schema inference upon
writing, reading, or offline.
6. Real Time vs Batch
Transfer data from A -> B on
demand.
Push data from A -> B
consistently. Poll on data
sources or act upon
reception.
Batch Push model Pull model
Clients pull data from A to
write to B. Often times an
intermediate storage system
like Kafka is used to achieve
this.
7. • GOAL: Generate different forms for
websites
• Store user information
• Forms cannot change over time
Real world scenario: Form generator
8. Lesson #1: Structure endpoint wisely
Form Definition
id
form name
form metadata
Form 1
id
<field 1>
<field 2>
<field 3>
Form 2
id
<field 1>
<field 2>
<field 3>
Form Definition
id
form name
Field Definition
id
form id
field name
type
Field Values
id
field id
value
9. • GOAL: Generate list of active contributors on a repository and
general stats about a repository relative to all other repositories.
• Scheduled batch Change Data Capture (CDC).
Real world scenario: Scrape github
11. • Ingesting data twice doesn’t matter in a lot of cases.
• The cost of re-processing or re-ingesting a few records is
normally pretty low.
• It’s easy to manage and implement.
• Exactly once semantics, in contrast, is not feasible
– Usually requires some de-duping
Lesson #2: At least once is acceptable
14. • Change Data Capture (CDC) without a change log or an easy
way to calculate differences is hard.
• Almost always requires some customized effort.
Lesson #3: CDC is hard
15. • GOAL: Gather impressions and click information. Attribute to
different vendors based on impressions and clicks.
• Expose a view for customers to understand their usage.
• NRT with batch error checking.
Real world scenario: Ad attribution system
16. • What is the incidence of errors?
• How frequently should errors be checked?
• Is data loss acceptable?
• Is duplication acceptable?
Lesson #4: Know thy SLA
18. Push version analysis
• Negatives
– Scribe would lose data in some edge cases. That’s not good for
attribution systems (money involved).
– Amount of messages being written to HBase would cause major
compactions on a weekly basis halting the pipeline.
• Positives
– Latency was super low
– Relatively easy to maintain given scribe configuration
* Flume would have been a better choice! It has better reliability
guarantees!
20. Pull version analysis
• Negatives
– Requires more management and configuration.
• Positives
– Choose data loss with at most once or at least once semantics.
– Intermediate storage relieves HBase.
* Kafka would have been a cool choice! It has better data retention
and scalability!
21. 1. Structureless (or simple structure) and schemaless
a. Log file (e.g. uuid|val1|val2|val3|...)
2. Structured without explicit schema
a. JSON (e.g. {“key1”: “val1”, ...})
3. Structured with explicit schema
a. Avro (e.g. {“key1”: “val1”, ...}, but with schema)
Lesson #5: Record format and schema
22. • Verbosity directly related to human readability
• Verbosity impacts performance of systems
• A verbose and readable RPC: XML, YAML, JSON, etc.
• A not-so-verbose and not-so-readable RPC: MessagePack,
Protobuf, Avro binary, Parquet, etc.
• Sufficient tooling can make human readability less necessary.
Issues with structure
23. • Flexibility and structure are inversely related.
• A flexible schema
– Doesn’t require an upfront definition
– Allows you to make and validate assumptions about the data.
– Easy to extend, but difficult to track changes
– May have nested structures
■ e.g. uuid|val1|val2|{“field1”: “value1”, ...}|...
• A structured schema
– Easier for everyone (human and computer) to understand
– Saves time when serializing/deserializing
Issues with schema
24. • Where is the data coming from?
• How has it changed as it enters the system?
• Snapshots?
• Who touched the data.
Lesson #6: Record lineage
25. 1. Structure endpoints wisely
2. At least once semantics is easy and acceptable
3. CDC is hard
4. Know thy SLA
5. Record format and schema should be thought through
6. Record lineage (provenance)
Summary of lessons
26. 1. Keep track of erroneous records
a. Anomalies lead to more knowledge about data source
b. Improves debugging
2. Keep transformations to a minimum
a. Schema inference makes sense
b. Massive computations can slow down the ingest process and cause
back pressure in the pipeline
Extra ideas