Douglas Moore discusses common anti-patterns seen when implementing big data solutions based on lessons learned from working with over 50 clients. He covers anti-patterns in hardware and infrastructure like relying on outdated reference architectures, tooling like trying to do analytics directly in NoSQL databases, and big data warehousing like over-curating data during ETL. The key is to understand the strengths and weaknesses of different tools and deploy the right solution for the intended workload.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
Andrew Ryan describes how Facebook operates Hadoop to provide access as a shared resource between groups.
More information and video at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
Andrew Ryan describes how Facebook operates Hadoop to provide access as a shared resource between groups.
More information and video at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
Apache Kudu (incubating) is a new storage engine for the Hadoop ecosystem that enables extremely high-speed analytics without imposing data-visibility latencies. This talk provides an introduction to Kudu, and provides an overview of how, when, and why practitioners use Kudu as a platform for building analytics solutions.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Big data hadoop flume spark cloudera Oracle big data appliance apache , oracle loader for hadoop, Big data copy. Exadata to Big data appliance. bilginc It academy.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
Apache Kudu (incubating) is a new storage engine for the Hadoop ecosystem that enables extremely high-speed analytics without imposing data-visibility latencies. This talk provides an introduction to Kudu, and provides an overview of how, when, and why practitioners use Kudu as a platform for building analytics solutions.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Big data hadoop flume spark cloudera Oracle big data appliance apache , oracle loader for hadoop, Big data copy. Exadata to Big data appliance. bilginc It academy.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Let's make something together. At Create the Bridge we believe in creating highly functional and beautiful products. Check out our agency presentation to learn more about us and what we can offer. Feel free to visit our website and reach out to us, we're always here to help. :) www.createthebridge.com
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
The Hadoop Ecosystem for developers session in DevGeekWeek in Israel.
This was a day long session talking about big data problems and the hadoop solution. we also talked about Spark and NoSQL.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session about Apache Spark.
In the agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming
- Performance and Troubleshooting
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
1. | 1
Big Data Anti-Patterns:
Lessons from the Front Lines
Douglas Moore
Principal Data Architect
Think Big, a Teradata Company
2. | 2
About Douglas Moore
Think Big – 3 Years
- Roadmaps
- Delivery
• BDW, Search, Streaming
- Tech Assessments
2
Before Big Data
- Data Warehousing
- OLTP
- Systems Architecture
- Electricity
- High End Graphics
- Supercomputers
- Numerical Analysis
Contact me at:
@douglas_ma
3. | 3
Think Big
3
4yr Old “Big Data” Professional Services Firm
- Roadmaps
- Engineering
- Data Science
- Hands on Training
Recently acquired by Teradata
• Maintaining Independence
4. | 4
Content Drawn From Vast Amounts of Experience
4
…
50+ Clients
Leading
security
software
vendor
Leading
Discount
Retailer
5. | 5
I started out with just 3 topics…
Then while on the road to Strata,
I met 7 big data architects
- Who had 7 clients
• Who had 7 projects
• That demonstrated 7 Anti-Patterns
Introduction
5
Big Data Anti-pattern:
“Commonly applied but bad solution”
I95 Wikipedia
6. | 6
Three Focus Areas
• Hardware and Infrastructure
• Tooling
• Big Data Warehousing
6
7. | 7
Hardware & Infrastructure
Reference Architecture Driven
- 90’s & 00’s data center patterns
- Servers MUST NOT FAIL
- Standard Server Config
• $35,000/node
• Dual Power supply
• RAID
• SAS 15K RPM
• SAN
• VMs for Production
• Flat Network
7
[Image source: HP: The transformation
to HP Converged Infrastructure]
Automated provisioning is a good thing!
8. | 8
Locality Locality Locality
- Bring Computation to Data
#1 Locality
8
Co-locate data and compute
Locally Attached Storage
Localize & isolate network traffic
Rack Awareness
VM Cluster Hadoop Cluster
9. | 9
#2 Sequential IO
Sequential IO >> Random Access
9
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Large block IO
Append only writes
JBOD
Image credit: Wikipedia.org
10. | 10
Increase # parallel components
- Reduce component cost
Data block replication
- Performance
- Availability
Commodity++ (2014)
- High density data nodes
- $8-12,000
- ~12 drives
- ~12-16 cores
- Buy more servers for the cost of
one
• 4-5x spindles
• 4-5x cores
#3 Increase Parallelism
10
Hadoop Cluster
11. | 11
Expect Failure1,2 Rack Awareness
Hadoop Cluster
Data Block Replication
Task Retry
Node Black Listing
Monitor Everything
Name Node HA
#4 Failure
11
13. | 13
Tooling: Just looking inside the box
“If it came in the box then I should use it”
Example
- Oozie for scheduling
13
Best Practice:
• Use your current enterprise scheduler
14. | 14
Tooling: NoSQL
14
• “Now I have all of my log data
in NoSQL, let’s do analytics
over it”
Example
- Streaming data into Mongo DB
• Running aggregates
• Running MR jobs
15. | 15
Best Practice
15
Best Practice:
• Split the stream
• Real-time access in NoSQL
• Batch analytics in Hadoop
16. | 18
Hadoop Streaming
- Integrate legacy code
- Integrate analytic tools
• Data science libs
Right Framework, Right Need…
Hadoop integrates any
type of application tooling
- Java
- Python
- R
- C, C++
- Fortran
- Cobol
- Ruby
18
17. | 19
Got to love Ruby
- Very Cool (or it was)
- Dynamic Language
- Expressive
- Compact
- Fast Iteration
Got to Hate Ruby
- Slow
- Hard to follow & debug
- Does not play well with
threading
Right Use Case – ETL, Wrong Framework
19
“It’s much faster to develop in,
developer time is valuable,
just throw a couple more boxes at it”
Bench tested Ruby ETL framework
at 5,000 records / second
18. | 20
Right Use Case – ETL, Wrong Framework…
20
DO THE MATH:
Storm Java: ~ 1MM+ events / second / server
Storm Ruby: 5000 * 12 cores = 60,000 events / second / server
= 16.67 times more servers
bit.ly/1t0HXJH
Best Practice:
• Write new code in fastest execution framework
• High value legacy code, analytic tools use Hadoop Streaming
• Innovation is Important: Test and Learn
19. | 21
Big Data Warehousing
Hadoop Use Cases
1. ETL Offload
2. Data Warehousing
21
Hadoop Data Types
1. Structured
2. Semi-structured
3. Multi or Unstructured
20. | 22
Don’t over curate:
“We are going to
- Define and parse 1,000
attributes from the machine log
files on ETL servers,
- load just what we need to,
- this will take 6 months”
HCatalog
Navigator, Loom,…
UDFs, UDTFs
- JSON, Regex built in
- Custom Java
- Hadoop Streaming (e.g. use
Python, Perl)
Hive Partitions
Recursive directory reads
Bucket Joins
Columnar formats
- ORC
- Parquet
First Principles: #5 Schema on Read
22
Best Practices:
• Define what you need to
• Parse on Demand
• Structure to optimize
• Beware the data palace
fountain & data swamp
21. | 23
Right Schema
23
3NF - Transactional Source System Schema
order
customer
order line
product
contract
sales_person
Dimensional Schema
customer contract
order
product
order
line
sales_person
Data Warehouse
OLTP
customer contract order order line product sales_person
Hadoop
De-normalized schema
22. | 24
Right Workload, Right Tool
Workload Hadoop NoSQL MPP, Reporting
DBs, Mainframe
ETL
Business Intelligence
Cross business reporting
Sub-set analytics
Full scan analytics
Decision Support TBs-PBs GB-TBs
Operational Reports
Complex security requirements
Search
Fast Lookup
23. | 25
Understand strengths & weaknesses of each choice
- Get help as needed to make your first effort successful
Deploy the right tool for the right workload
Test and Learn
Summary
25
http://www.keepcalm-o-matic.
co.uk/p/keep-calm-and-climb-on-
94/
24. | 26
Thank You
26
Douglas Moore
@douglas_ma
Work with the best on a wide range of cool projects:
recruiting@thinkbiganalytics.com
25. Work with the
Leading Innovator in Big Data
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
27
Editor's Notes
When I’m not rock climbing, I’m doing big data data architecture for Think Big clients.
Helping them realize value with data analytics.
3 Years at Think Big
Big Data Warehouse
Search
Streaming
Big Data Roadmaps
Tech assessments
Worked on 5 distributions, including the original Apache
The strengths we bring into this presentation….
This is not even half of it
I wrote the proposal for this spot with just 3 topics in mind,
then
I began discussing this with my colleagues and the topic generated quite a bit of buzz.
It’s amazing how much energy people will put into explaining crazy things they’ve seen.
With all of the architects, clients, projects how many anti-patterns did I come to Strata with?
343 if you’re doing the math in your head.
Many of our customers are big successful companies that have been around a long time
During the 90’s and the Oughts, they developed reference architectures
Based on input from companies like EMC, HP, IBM
They developed the mindset: Mindset “SERVERS MUST NOT FAIL”
And, … This what you needed for your Oracle OLTP servers to supplant Mainframe DB2 & Tandem.
These servers can range up to $35k/node. At one client they were too embarrassed to show me how much they referenced “Propietary Information”
I could tell they spent a lot, based on the data node specs:
Dual power supplies, RAID, 15,000 RPM SAS, SAN, VMs, flattened network…
The best part of this reference architecture is Automated provisioning & configuration management.
Unfortunately I don’t see that as often as I would like.
Let’s go back to first principles of Hadoop & Big Data….
[turn]
Also seeing Hadoop companies migrate back this way to capture dying or dead data.
E.g. Cloudera – Isilon partnership
- Not the best performance but does turn that archive data from “dead data” into data producing business value
Let’s talk about big data and hadoop first principles:
Everything is about locality
Best to bring your computation to where your data is.
What hadoop does is shown here in the diagram
Doesn’t matter, Hard Disk, SSD, Main Memory, Cache. Sequential IO always faster
See that actuator Arm? On a modern drive it takes a good 4ms to move from one track to another.
That’s a lifetime in terms of computing.
SANS, Virtual drives, multi-tenant VM farms all essentially incur random access reads (and writes).
Hadoop strives to move that arm as infrequently as possible.
The only thing slower than a disk seek is a round trip to the Netherlands.
So what Hadoop does:
Does IO in large blocks
Append Only Writes
Disks on each node are in a ‘ Just a Bunch of Disk’ configuration, and Hadoop is the only workload accessing those drives. It can force sequential access and optimal through put
Increase your parallelism
- Which requires brings us to principle #0 – Do not lock, sync, wait, dawdle, dally, hover or loiter
Increase # components
To keep in budget though, you must spend less per component to get more value per dollar, euro or yen.
Hadoop helps out in this area:
Data block replications
Buy more servers for the cost of one
You get more spindles and cores
More spindles mean more IO
More cores mean more compute
Ultimately to get more throughput
With more components you need to expect failure. Handle them in software
“That reminds me of the operations team that said, it's fault tolerant, so we never have to fix it. Imagine a 300 node cluster where 60 nodes were down (blacklisted) for over two months because the system was fault tolerant, and therefore the tickets to fix it were low priority. In some environments, low priority tickets never get touched. This was that kind of environment”
Best practice: Fix it in the morning
There are more first principles, namely no locking….
Now I want to talk a bit about tooling
, tools within and around Hadoop
A common anti-pattern is: “If it came in the box…
Example Oozie
Your enterprise scheduler is well coordinated with the rest of your environment.
Others include, “We should use Pig”
Me: But all your people are SQL programmers
Customer: We should use Pig
We’re already demoing Hive.
Another reason for Hive: Hive leads in terms of optimizations
I like Pig for deeply nested data structures.
A NoSQL anti-pattern develops over the course of time:
First streaming data is loaded into NoSQL to provide some near-real time content serving
This falls down because of some of the previous first principles
Namely locality
You end up having your storage in your Mongo cluster, and your compute in the Hadoop cluster, you move the data across the data center to do some aggregate.
Best practice… split the streams
There’s a place for each of these technology, we see them as complementary.
It takes time for these tools to mature. For example, Hive date types didn’t mature until Hive 0.13.
So, choose the right tool for the right job
Let’s talk about Hadoop streaming
- Not to be confused with stream processing, samza, spark streaming, storm
The key purpose of hadoop streaming is to
Integrate legacy code
Integrate analytic tools, like Python, R…
Got to love
Got to hate
I often hear this argument
New code , especially high volume ETL code
High value legacy code – Hadoop streaming
Top Hadoop Use Cases
#2 use case for Hadoop
Quickly Combine data from data silos systems
#1 data type is structured
#2 is semi-structure – Machine logs, web server logs
#3 is multi-structured – text, image, voice,…
When I structure, I structure right.
Thanks for your time today. We look forward to helping you drive new value from big data.
Questions?
Next steps?