Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
Test strategies for data processing pipelines, v2.0Lars Albertsson
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
Eventually, time will kill your data pipelineLars Albertsson
Race conditions and intermittent failures, daylight saving time, time zones, leap seconds, and overload conditions - time is a factor in many of the most annoying problems in computer systems. Data engineering is not exempt from problems caused by time, but also has a slew of unique problems. In this presentation, we will enumerate the time-related problems that we have seen cause trouble in data processing system components, including data collection, batch processing, workflow orchestration, and stream processing. We will provide examples of time-related incidents, and also tools and tricks to avoid timing issues in data processing systems.
Chris Larsen (Yahoo!)
This year we'll talk about the joys of the HBase Fuzzy Row Filter, new TSDB filters, expression support, Graphite functions and running OpenTSDB on top of Google’s hosted Bigtable. AsyncHBase now includes per-RPC timeouts, append support, Kerberos auth, and a beta implementation in Go.
Test strategies for data processing pipelines, v2.0Lars Albertsson
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
Eventually, time will kill your data pipelineLars Albertsson
Race conditions and intermittent failures, daylight saving time, time zones, leap seconds, and overload conditions - time is a factor in many of the most annoying problems in computer systems. Data engineering is not exempt from problems caused by time, but also has a slew of unique problems. In this presentation, we will enumerate the time-related problems that we have seen cause trouble in data processing system components, including data collection, batch processing, workflow orchestration, and stream processing. We will provide examples of time-related incidents, and also tools and tricks to avoid timing issues in data processing systems.
Chris Larsen (Yahoo!)
This year we'll talk about the joys of the HBase Fuzzy Row Filter, new TSDB filters, expression support, Graphite functions and running OpenTSDB on top of Google’s hosted Bigtable. AsyncHBase now includes per-RPC timeouts, append support, Kerberos auth, and a beta implementation in Go.
This is a talk I did for JavaOne 2009. The focus of the talk was memory management and system monitoring with freely available tools that are in the jdk or open source.
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
Ted Dunning – Very High Bandwidth Time Series Database Implementation
This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Another year, another talk about OpenTSDB running on HBase.
We'll discuss topics like:
Yahoo's append co-processor saving CPU resources by resolving atomic appends at compaction or query time.
The pros and cons of HBASE-15181, Date Tiered compaction for time series data.
Yahoo's experiments with unbounded secondary index on HBase.
OpenTSDB's 3.0 featuring a new query engine and API.
by Chris Larsen of Yahoo!
This is the speech Shen Li gave at GopherChina 2017.
TiDB is an open source distributed database. Inspired by the design of Google F1/Spanner, TiDB features in infinite horizontal scalability, strong consistency, and high availability. The goal of TiDB is to serve as a one-stop solution for data storage and analysis.
In this talk, we will mainly cover the following topics:
- What is TiDB
- TiDB Architecture
- SQL Layer Internal
- Golang in TiDB
- Next Step of TiDB
Bonnier News is the largest news organisation in Sweden, publishing Dagens Nyheter and Expressen, two of the country’s largest newspapers. When we needed to build a new data processing platform that could accommodate the needs of many different, competing brands, we turned to Openshift and Kubernetes. In this presentation, we will describe the architectural tradeoffs and choices we made, and how we have been able to deploy data flows at a high rate by focusing on technical simplicity.
PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application?
Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.
Graph databases are used to represent graph structures with nodes, edges and properties. Neo4j, an open-source graph database is reliable and fast for managing and querying highly connected data. Will explore how to install and configure, create nodes and relationships, query with the Cypher Query Language, importing data and using Neo4j in concert with SQL Server... Providing answers and insight with visual diagrams about connected data that you have in your SQL Server Databases!
This is a talk I did for JavaOne 2009. The focus of the talk was memory management and system monitoring with freely available tools that are in the jdk or open source.
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
Ted Dunning – Very High Bandwidth Time Series Database Implementation
This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Another year, another talk about OpenTSDB running on HBase.
We'll discuss topics like:
Yahoo's append co-processor saving CPU resources by resolving atomic appends at compaction or query time.
The pros and cons of HBASE-15181, Date Tiered compaction for time series data.
Yahoo's experiments with unbounded secondary index on HBase.
OpenTSDB's 3.0 featuring a new query engine and API.
by Chris Larsen of Yahoo!
This is the speech Shen Li gave at GopherChina 2017.
TiDB is an open source distributed database. Inspired by the design of Google F1/Spanner, TiDB features in infinite horizontal scalability, strong consistency, and high availability. The goal of TiDB is to serve as a one-stop solution for data storage and analysis.
In this talk, we will mainly cover the following topics:
- What is TiDB
- TiDB Architecture
- SQL Layer Internal
- Golang in TiDB
- Next Step of TiDB
Bonnier News is the largest news organisation in Sweden, publishing Dagens Nyheter and Expressen, two of the country’s largest newspapers. When we needed to build a new data processing platform that could accommodate the needs of many different, competing brands, we turned to Openshift and Kubernetes. In this presentation, we will describe the architectural tradeoffs and choices we made, and how we have been able to deploy data flows at a high rate by focusing on technical simplicity.
PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application?
Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.
Graph databases are used to represent graph structures with nodes, edges and properties. Neo4j, an open-source graph database is reliable and fast for managing and querying highly connected data. Will explore how to install and configure, create nodes and relationships, query with the Cypher Query Language, importing data and using Neo4j in concert with SQL Server... Providing answers and insight with visual diagrams about connected data that you have in your SQL Server Databases!
Behavior Driven development is the process of exploring, discovering, defining and driving the desired behavior of software system by using conversation, concrete examples and automated tests.
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc.
But do they ask themselves is there really a reason for that?
In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling.
Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world.
Promise that I’ll share some stories from the trenches :)
(And about the “pool” thing...I don’t really know how to swim)
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
Speakers: Dominic Dwyer & Wei Shan Ang
This talk was presented in Percona Live Europe 2017. However, we did not have enough time to test against more scenario. We will be giving an updated talk with a more comprehensive tests and numbers. We hope to run it against citusDB and MongoRocks as well to provide a comprehensive comparison.
https://www.percona.com/live/e17/sessions/high-performance-json-postgresql-vs-mongodb
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
My benchmarks brings all the boys to the yardIon Dormenco
An introduction to BenchmarkDotNet and how to use it to measure your algorithms.
Some examples on interesting measurements.
Samples https://github.com/idormenco/Benchmarking
In this talk we'll look at simple building-block techniques for predicting metrics over time based on past data, taking into account trend, seasonality and noise, using Python with Tensorflow.
Similar to Lessons learned from designing a QA Automation for analytics databases (big data) (20)
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
Couchbase is a popular open source NoSQL platform used by giants like Apple, LinkedIn, Walmart, Visa and many others and runs on-premise or in a public/hybrid/multi cloud.
Couchbase has a sub-millisecond K/V cache integrated with a document based DB, a unique and many more services and features.
In this session we will talk about the unique architecture of Couchbase, its unique N1QL language - a SQL-Like language that is ANSI compliant, the services and features Couchbase offers and demonstrate some of them live.
We will also discuss what makes Couchbase different than other popular NoSQL platforms like MongoDB, Cassandra, Redis, DynamoDB etc.
At the end we will talk about the next version of Couchbase (6.5) that will be released later this year and about Couchbase 7.0 that will be released next year.
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
achine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
קוראים לי ניצן אור קדראי ואני עומדת בצומת המעניינת שבין טכנולוגיה, מדיה ואקטיביזם.
בארבע וחצי השנים האחרונות אני עובדת בידיעות אחרונות, בהתחלה כמנהלת המוצר של אפליקציית ynet וכיום כמנהלת החדשנות.
הייתי שותפה בהקמת עמותת סטארט-אח, עמותה המספקת שירותי פיתוח ומוצר עבור עמותות אחרות, ולאחרונה מתעסקת בהקמת קהילה שמטרתה לחקור את ההיבטים הטכנולוגיים של תופעת הפייק ניוז ובניית כלים אפליקטיביים לצורך ניהול חכם של המלחמה בתופעה.
ההרצאה תדבר על תופעת הפייק ניוז. נתמקד בטכנולוגיה שמאפשרת את הפצת הפייק ניוז ונראה דוגמאות לשימוש בטכנולוגיה זו.
נבחן את היקף התופעה ברשתות החברתיות ונלמד איך ענקיות הטכנולוגיה מנסות להילחם בה.
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
MAKING YOUR ANALYTICS TALK BUSINESS
Aligning your analysis to the business is fundamental for all types of analytics (digital or product analytics, business intelligence, etc) and is vertical- and tool agnostic. In this talk we will build on the discussion that was started in the previous meetup, and will discuss how analysts can learn to derive their stakeholders' expectations, how to shift from metrics to "real" KPIs, and how to approach an analysis in order to create real impact.
This session is primarily geared towards those starting out into analytics, practitioners who feel that they are still struggling to prove their value in the organization or simply folks who want to power up their reporting and recommendation skills. If you are already a master at aligning your analysis to the business, you're most welcome as well: join us to share your experiences so that we can all learn from each other and improve!
Bios:
Eliza Savov - Eliza is the team lead of the Customer Experience and Analytics team at Clicktale, the worldwide leader in behavioral analytics. She has extensive experience working with data analytics, having previously worked at Clicktale as a senior customer experience analyst, and as a product analyst at Seeking Alpha.
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
In the talk we will discuss how to break down the company’s overall goals all the way to your BI team’s daily activities in 3 simple stages:
1. Understanding the path to success - Creating a revenue model
2. Gathering support and strategizing - Structuring a team
3. Executing - Tracking KPIs
Bios:
Omri Halak -Omri is the director of business operations at Logz.io, an intelligent and scalable machine data analytics platform built on ELK & Grafana that empowers engineers to monitor, troubleshoot, and secure mission-critical applications more effectively. In this position, Omri combines actionable business insights from the BI side with fast and effective delivery on the Operations side. Omri has ample experience connecting data with business, with previous positions at SimilarWeb as a business analyst, at Woobi as finance director, and as Head of State Guarantees at Israel Ministry of Finance.
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
Lecturer has Deep experience defining Cloud computing, security models for IaaS, PaaS, and SaaS architectures specifically as the architecture relates to IAM. Deep Experience Defining Privacy protection Policy, a big fan of GDPR interpretation.
DeelExperience in Information security, Defining Healthcare security best practices including AI and Big Data, IT Security and ICS security and privacy controls in the industrial environments.
Deep knowledge of security frameworks such as Cloud Security Alliance (CSA), International Organization for Standardization (ISO), National Institute of Standards and Technology (NIST), IBM ITCS104 etc.
What Will You learn:
Every day, the website collects a huge amount of data. The data allows to analyze the behavior of Internet users, their interests, their purchasing behavior and the conversion rates. In order to increase business, big data offers the tools to analyze and process data in order to reveal competitive advantages from the data.
What Healthcare has to do with Big Data
How AI can assist in patient care?
Why some are afraid? Are there any dangers?
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
Building a low latency (sub millisecond), high throughput database that can handle big data AND linearly scale is not easy - but we did it anyway...
In this session we will get to know Aerospike, an enterprise distributed primary key database solution.
- We will do an introduction to Aerospike - basic terms, how it works and why is it widely used in mission critical systems deployments.
- We will understand the 'magic' behind Aerospike ability to handle small, medium and even Petabyte scale data, and still guarantee predictable performance of sub-millisecond latency
- We will learn how Aerospike devops is different than other solutions in the market, and see how easy it is to run it on cloud environments as well as on premise.
We will also run a demo - showing a live example of the performance and self-healing technologies the database have to offer.
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS
-Learn how to connect BI and product management to solve business problems
-Discover how to lead clients to ask the right questions to get the data and insight they really want
-Get pointers on saving your time and your company's resources by understanding what your customers need, not what they ask for
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
2. Agenda
● Introduction to Big Data Testing
● Methodological approach for QA Automation road map
● True Story
● The Solution?
● Challenges of Big Data Testing.
● Performance & Maintenance.
6. Big Data ?
● Structured (SQL, tabular, strings, numbers)
● Unstructured (logs, pictures, binary, json,blob, video etc)
7. OLAP or OLTP?
● Simply put:
○ One query running on a lot of data for reporting/analytics
○ Many queries running quickly in parralel applications (e.g user login to a website, credentials
are stored on a DB)
8. Type of Big Data products
● Databases
○ OLAP, OLTP
○ Sql, NO SQL
○ In Memory, Disk based.
○ Hadoop Ecosystem, Spark
● Ecosystem tools
○ ETL tools
○ Visualizations tools
● General Application
● Analytics Based products
○ Froud
○ Cyber
○ Finances and etc.
9. Expectation Matching for today’s lecture...
● OLAP Database
● Structured data only, Synthetic data was used in testing.
● (not about QA of analytics based products/solutions).
● (not about hadoop , event streaming cluster ).
● (not try to sell anything)
● In a nutshell:
○ How to test an SQL based database?
○ What sort of challenges were met?
10. How Hard Can it be (aggregation)?
● Select Count(*) from t;
○ Assume
■ 1 Trillion records
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale up limitations?
11. How Hard Can it be (join)?
● Select * from t1 join t2 where t1.id =t2.id
○ Assume
■ 1 Trillion records both tables.
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale out limitations?
12. 3 Pain Points in Big Data DBMS
Big Data System
Ingress data rate Egress data rate
Maintenance rate
13. Big Data ingress challenges
● Parsing of Data type
○ Strings
○ Dates
○ Floats
● Rate of data coming into the system
● ACID: Atomicity, Consistency, Isolation, Durability
● Compression on the fly (non unique values)
● amount of data
● On the fly analytics
● Time constraints (target: x rows per sec/hour)
14. Big Data egress challenges
● Sort
● Group by
● Reduce
● Join (sortMergeJoin)
● Data distribution
● Compression
● Theoretical Bottlenecks of hardware.
17. Product requirements?
● What are the product’s supported use cases?
● Supported Scale?
● Supported rate of ingress?
● Supported rate on egress?
● Complexity of Cluster?
● High availability?
18. The Automation: Must have requirements?
● Scale out/up?
● Fault tolerance!
● Reproducibility from scratch!
● Reporting!
● “Views” to avoid duplication for testing?
● Orchestration of same environment per Developer!
● Debugging options
● versioning!!!
19. The method● Phase 0: get ready
○ Product requirements & business constraints
○ Automation requirements
○ Creating a testing matrix based on insight and product features.
● Phase 1: Get insights (some baselines)
○ Test Ingress separately, find your baseline.
○ Test Egress separately, find your baseline.
○ Test Ingress while Egress in progress., find your baseline.
○ Stability and Performance
● Phase 2: Agile: Design an automation system that satisfies the requirements
○ Prioritize automation features based on your insights from phase 1.
○ Implement an automation infrastructure as soon as possible.
○ Update your testing matrix as you go (insight will keep coming)
○ Analyze the test results daily.
● Phase 3: Cost reduction
○ How to reduce compute time/ IO pattern/ network pattern/storage footprint
○ Maintenance time (build/package/deploy/monitor)
○ Hardware costs
21. So why build a DBMS from scratch?
● Most compiler are designed for CPU → execution tree for CPU runtime
engine → new compiler with GPU resource in mind.
● Most Algorithm were written for CPU → same algorithm logic, redesign for
parallelism philosophy of GPU → new runtime with GPU resource in mind →
performance increase by order of magnitude.
● VISION: fastest DBMS, cost effective, true scalability.
22. True story
● Product: Big Data gpu based DBMS system for analytics. [peta scale]
○ Structed data only, designed for OLAP use cases only
● Technical Constraints
○ How to test SQL syntax?(infinite input) SQL Coverage from 0% to 20% to 90% to 100%?.
○ What is the expected impact of big data?
○ How to debug performance issues?
○ Can you virtualize? Should you virtualize?
○ Running time - how much it takes to run all the tests?
● Company Challenges
○ Cost of Hardware? High end 2U Server
○ Expertise ? skillset?
○ Human resources: Size of QA Automation team?
○ How do you manage automation?
○ What is your MVP? (Chicken and Egg)
○ TIME TO MARKET!!! The product needs to be released.
23. The baseline concept
● Requirements
○ CSV , DDL , query
○ Competing DB
○ Your DB
○ Synthetic data used mostly.
○ (real data in peta scale is hard to comeby)
● Steps
○ Insert same data to same table on both DB’s
○ Run same query on both DB’s
○ Compare results on both.
○ If equal → test pass.
24. The Solution: SQL testing framework.
● Lightweight python test suite per feature
○ DDL
○ Set of Queries
○ Json for data generation requirements
○ Data generation utilities (genUtil)
○ test results report - in CSV format.
○ Expected error mechanism.
○ Results compare mechanism
○ Command line arguments for advanced testing/config/tuning/views.
● Wrapper test suite
○ group set of test suites by logic - e.g Daily
○ Aggregate reporting mechanism
● Scheduling ,reporting,monitoring,deployment, alerting mechanism
25. Testing concept of SQL syntax
Simple Aggregation Joins
Simple x x x
Aggregation x x x
Joins x x x
26. Challenges with SQL syntax testing
● (Binary data not supported in the product)
● Repetition of queries.
● Different data type names.
● Different data ranges per data types.
● Different accuracy per data types.
● The competition DB that supports big data is expensive...
● Accuracy of results. (different DB’s return different accuracy)
● SQL has some extreme cases (differ per vendor).
● Datetime format.
● Unsupported features.
● Duplication of testing.
● Very hard to predict which queries are useful (negative and positive testing).
27. Challenges with Small Data testing
● Generating Random data
○ Reproducible every time -->Strict data ranges per test (length, numeric range, format, accuracy)
○ What is the amount of unique values? Different histogram, different bottlenecks:
■ Unique values challenges:
● Per chunk?
● Per Column
■ Non unique values challenges:
● String?
○ Lengths
○ Compressible?
● Numeric?
○ Overflow?
○ Floating point?
28. Testing matrix: Data Integrity on Big data.
Data (no null) Compression Null Data Lookup. Partitions JDBC/ODBC
1 row
1M Rows
10M rows
100M Rows
1B rows
10B rows
100B rows
1 Trillion rows
10 Trillion
rows
29. Challenges with Big Data testing.
● Generating Time of data → genUtil with json for user input
● Reproducibility → genUtil again.
● Disk space for testing data → peta scale product → peta scale testing data
● Results compared mechanism on big data require a big data solution→ SMJ
● Network bottlenecks on a scale out system...
● ORDER IS NOT GUARANTEED!
30. Insights on the fly: User Defined Test Flow
● A series of queries in a specific order - crushed the system
● Solution: Automation infrastructure on a group of queries: User defined Test
Flow.
● Good for: pre and post condition of , custom testing, system perspective
testing. (partitions)
31. The solution: some Metrics
1. Extra large Sanity Testing per commit (!)
2. Hourly testing on latest commit (24 X 7)
a. 800 queries
b. Zero data
3. Daily test:
a. 400,000 queries
b. Zero data for 90% , 10% testing upto 1B records.
4. Weekly testing:
a. Big data testing 10B and above.
5. Monthly testing:
a. Full regression per version.
32. The solution: Hardware perspective
● 50 x Desktops: Optiplex 7040, I7 4 cores, 32GB. 1X 4Tb.
● 10 x High End Servers: Dell R730/R720: 20 cores, 128GB. 16X 1TB disks.
Nvidia K40
● DDN storage with 200TB+ GPFS
● Mellanox FDR switch for the storage
● 1GB switches for the desktop compute nodes.
34. Performance Challenges: app perspective
● Architecture bottlenecks [ count(*), group by, compression, sort Merge Join]
● What is IO pattern?
○ OLTP VS OLAP.
○ Columnar or Row based?
○ How big is your data?
○ READ % vs WRITE %.
○ Sequential? random?
○ Temporary VS permanent
○ Latency VS. throughput.
○ Multi threaded ? or single thread?
○ Power query? Or production?
○ Cold data? Host data?
35. Performance Challenges: OPS perspective
● Metrics
○ What is Expected Total RAM required per Query?
○ OS RAM cache , CPU Cache hits ?
○ SWAP ? yes/no how much?
○ OS metrics - open files, stack size, realtime priority?
● Theoretical limits?
○ Disk type selection -limitation, expected throughput
○ Raid controller, RAID type? Configuration? caching?
○ File system used? DFS? PFS? Ext4? XFS?
○ Recommended file system filesystem Block size? File system metadata?
○ Recommended File size on disk?
○ CPU selection - number of threads , cache level, hyper threading. Cilicon
○ Ram Selection - and placement on chassi
○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for?
○ PCI Express 3, 16 Langes. PCI Switch.
36. Maintenance challenges: OPS Perspective
How to Optimize your hardware selection ? ( Analytics on your testing data)
a. Compute intensive
i. GPU CORE intensive
ii. CPUcores intensive
1. Frequency
2. Amount of thread for concurrency
b. IO intensive?
i. SSD?
ii. NearLine sata?
iii. Partitions?
c. Hardware uniformity - use same hardware all over.
d. Get rid of weak links
37. Maintenance challenges: DevOps Perspective
● Lightweight - python code
○ Advantage → focus on generic skill set (python coding)
○ Disadvantage → reinventing the wheel
● Fault tolerance - Scale out topology
○ Cheap desktops → quick recovery from Image when hardware crushes
■ Advantage - >
● Cheap, low risk, pay as you go.
● Gets 90 % of the job DONE!
■ Disadvantage
● footprint is hell.
● Management of desktops - requires creativity
● Rely heavily on automation feature: reproducibility from scratch
● Validation automation on infrastructure & Deployment mechanism.
38. Maintenance challenges: DevOps Perspective
● Infrastructure
○ Continuous integration: continuous build, continuous packaging, installer.
○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud)
○ Continuous testing: Sanity, Hourly, Daily, weekly
● Reporting and Monitoring:
○ Extensive report server and analytics on current testing.
○ Monitoring: Green/Red and system real time metrics.
39. Maintenance challenges: Innovation Perspective
1. Innovation on Data generation (using GPU to generate data)
2. Innovation on Competitor DB running time (hashing)
3. Innovation on Workload (one dispatcher, many compute nodes)
4. Innovation on saving testing data /compute time
a. Hashing
b. Cacheing
c. Smart Data Creating - no need to create 100TB CSV to generate 100TB test.
d. Logs: Test results were managed on The DBMS itself :)
5. File system ZFS Cluster (utilizing unused disk space, redundancy, deduplication)
6. Nvidia TK1 /GRID GPU → Footprint!!!
7. “Virtualizing GPU” → Footprint!!!
a. Dockers
b. rCUDA
8. Amazon p2 gpu instances did not exist at that time (even so still very expensive)
40. Building the team Challenge
● Engineering skills required
■ Hardware: Servers
■ IO (storage, disks, RAID, IO patten)
■ FileSystems
■ Linux , CLI, CMD, installing, compiling, GIT
■ Basic network understanding
■ High Availability, Redundancy
■ HPC
● Coding: python.
● QA Big Data mindset
● DevOps mindset
● Automation Mindset
● Systematic Innovation methodologies
41. Lesson learned (insights)● Very hard to guarantee 100% QA coverage
● Quality in a reasonable cost is a huge challenge
● Very easy to miss milestones with QA of big data,
● each mistake create a huge ripple effect in timelines
● Main costs:
○ Employees : training them.
○ Running Time: Coding of innovative algorithms inside testing framework,
○ Maintenance: Automation, Monitoring, Analyzing test results, Environments setup
○ Hardware: innovative DevOps , innovative OPS, research
● Most of our innovation was spent on:
○ Doing more tests with less resources
○ agile improvement of backbone
○ Avoiding big data complexity problems in our testing!
● Industry tools ? great, but not always enough. Know their philosophy...
● Sometimes you need to get your hands dirty
● Keep a fine balance between business and technology
42. Take away message:
● Testing simplifications:
○ Ingress only
○ Egress only
○ Ingress while egress
○ Testing matrix - useful on complex big data systems
● The QA automation team
○ coding/scripting skills is a MUST
○ Engineering skill is a MUST
○ DevOps understanding is a MUST
○ Allocating time for innovation is a MUST
○ Allocating time for cost reduction is a MUST
● The Automation infrastructure
○ KISS
○ MVP
○ IS A PRODUCT by itself. With its own Product manager and Architect
43. Special Thanks...
● Citi innovation Center
● Asaf - Birenzvieg Big Data & Data Science - Israel meetup
● PayPal Risk and Data solution Group