This document provides an overview of various AWS big data services including Athena, Redshift Spectrum, EMR, and Hive. It discusses how Athena allows users to run SQL queries directly on data stored in S3 using Presto. Redshift Spectrum enables querying data in S3 using standard SQL from Amazon Redshift. EMR is a managed Hadoop framework that can run Hive, Spark, and other big data applications. Hive provides a SQL-like interface to query data stored in various formats like Parquet and ORC on distributed storage systems. The document demonstrates features and provides best practices for working with these AWS big data services.
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
comprehensive Introduction to NoSQL solutions inside the big data landscape. Graph store? Column store? key Value store? Document Store? redis or memcache? dynamo db? mongo db ? hbase? Cloud or open source?
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
comprehensive Introduction to NoSQL solutions inside the big data landscape. Graph store? Column store? key Value store? Document Store? redis or memcache? dynamo db? mongo db ? hbase? Cloud or open source?
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
Avoiding destructive updates and keeping history of data using event sourcing approaches has large advantages for data analytics. This talk describes how Cassandra can be used as event journal as part of CQRS/Lambda Architecture using event sourcing and further used for data mining and machine learning purposes in a big data pipeline.
All the principles are demonstrated on an application called Muvr that we built. It uses data from wearable devices such as accelerometer in a watch or heartbeat monitor to classify user's exercises in near real time. It uses mobile devices and clustered Akka actor framework to distribute computation and then stores events as immutable facts in journal backed by Cassandra. The data are then read by Apache Spark and used for more expensive analytics and machine learning tasks such as suggests improvements to user's exercise routine or improves machine learning models for better real time exercise classification that can be used immediately. The talk mentions some of the internals of Spark when working with Cassandra and focuses on its machine learning capabilities enabled by Cassandra. A lot of the analytics are done for each user individually so the whole pipeline must handle potentially large amount of concurrent users and a lot of raw data so we need to ensure attributes such as responsiveness, elasticity and resilience.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
Key Value and Column Stores are not the only two data models Scylla is capable of. In this presentation learn the What, Why and How of building and deploying a graph data system in the cloud, backed by the power of Scylla.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.
About Robbie Strickland, Software Development Manager at The Weather Channel
Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
MongoDB has become the prominent NoSQL database engine and is now used for a wide variety of use cases because of its flexibility and ease of use for developers, while Scylla, a C++ rewrite of Cassandra, provides benefits through its architectural approach, including getting rid of the JVM and a CPU-level design that gets the most out of your hardware thanks to a CPU level design.
Numberly has been using MongoDB for over a decade and Scylla for over a year in production. The benefits of the Scylla architecture allied to the Cassandra ecosystem fuel a rapid adoption in a very wide range of use cases: from real-time data pipelines and analytics batches processing to web applications database backend.
Learn the motivations of such an adoption trend and why it proves to be successful so far while outlining its limits and why MongoDB is still here to stay!
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
In this talk I'll discuss how we can combine the power of PostgreSQL with TensorFlow to perform data analysis. By using the pl/python3 procedural language we can integrate machine learning libraries such as TensorFlow with PostgreSQL, opening the door for powerful data analytics combining SQL with AI. Typical use-cases might involve regression analysis to find relationships in an existing dataset and to predict results based on new inputs, or to analyse time series data and extrapolate future data taking into account general trends and seasonal variability whilst ignoring noise. Python is an ideal language for building custom systems to do this kind of work as it gives us access to a rich ecosystem of libraries such as Pandas and Numpy, in addition to TensorFlow itself.
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
Apache Cassandra and ScyllaDB are distributed databases capable of processing massive globally-distributed workloads. Both use the same CQL data query language. In this webinar you will learn:
- How are they architecturally similar and how are they different?
- What's the difference between them in performance and features?
- How do their software lifecycles and release cadences contrast?
Demystifying the Distributed Database LandscapeScyllaDB
What is the state of the art of high performance, distributed databases as we head into 2022, and which options are best suited for your own development projects?
The data-intensive applications leading this next tech cycle are typically powered by multiple types of databases and data stores — each satisfying specific needs and often interacting with a broader data ecosystem. Even the very notion of “a database” is evolving as new hardware architectures and methodologies allow for ever-greater capabilities and expectations for horizontal and vertical scalability, performance, and reliability.
In this webinar, ScyllaDB Director of Technology Advocacy Peter Corless will survey the current landscape of distributed database systems and highlight new directions in the industry.
This talk will cover different database and database-adjacent technologies as well as describe their appropriate use cases, patterns and antipatterns with a focus on:
- Distributed SQL, NewSQL and NoSQL
- In-memory datastores and caches
- Streaming technologies with persistent data storage
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
Avoiding destructive updates and keeping history of data using event sourcing approaches has large advantages for data analytics. This talk describes how Cassandra can be used as event journal as part of CQRS/Lambda Architecture using event sourcing and further used for data mining and machine learning purposes in a big data pipeline.
All the principles are demonstrated on an application called Muvr that we built. It uses data from wearable devices such as accelerometer in a watch or heartbeat monitor to classify user's exercises in near real time. It uses mobile devices and clustered Akka actor framework to distribute computation and then stores events as immutable facts in journal backed by Cassandra. The data are then read by Apache Spark and used for more expensive analytics and machine learning tasks such as suggests improvements to user's exercise routine or improves machine learning models for better real time exercise classification that can be used immediately. The talk mentions some of the internals of Spark when working with Cassandra and focuses on its machine learning capabilities enabled by Cassandra. A lot of the analytics are done for each user individually so the whole pipeline must handle potentially large amount of concurrent users and a lot of raw data so we need to ensure attributes such as responsiveness, elasticity and resilience.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
Key Value and Column Stores are not the only two data models Scylla is capable of. In this presentation learn the What, Why and How of building and deploying a graph data system in the cloud, backed by the power of Scylla.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.
About Robbie Strickland, Software Development Manager at The Weather Channel
Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
MongoDB has become the prominent NoSQL database engine and is now used for a wide variety of use cases because of its flexibility and ease of use for developers, while Scylla, a C++ rewrite of Cassandra, provides benefits through its architectural approach, including getting rid of the JVM and a CPU-level design that gets the most out of your hardware thanks to a CPU level design.
Numberly has been using MongoDB for over a decade and Scylla for over a year in production. The benefits of the Scylla architecture allied to the Cassandra ecosystem fuel a rapid adoption in a very wide range of use cases: from real-time data pipelines and analytics batches processing to web applications database backend.
Learn the motivations of such an adoption trend and why it proves to be successful so far while outlining its limits and why MongoDB is still here to stay!
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
In this talk I'll discuss how we can combine the power of PostgreSQL with TensorFlow to perform data analysis. By using the pl/python3 procedural language we can integrate machine learning libraries such as TensorFlow with PostgreSQL, opening the door for powerful data analytics combining SQL with AI. Typical use-cases might involve regression analysis to find relationships in an existing dataset and to predict results based on new inputs, or to analyse time series data and extrapolate future data taking into account general trends and seasonal variability whilst ignoring noise. Python is an ideal language for building custom systems to do this kind of work as it gives us access to a rich ecosystem of libraries such as Pandas and Numpy, in addition to TensorFlow itself.
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
Apache Cassandra and ScyllaDB are distributed databases capable of processing massive globally-distributed workloads. Both use the same CQL data query language. In this webinar you will learn:
- How are they architecturally similar and how are they different?
- What's the difference between them in performance and features?
- How do their software lifecycles and release cadences contrast?
Demystifying the Distributed Database LandscapeScyllaDB
What is the state of the art of high performance, distributed databases as we head into 2022, and which options are best suited for your own development projects?
The data-intensive applications leading this next tech cycle are typically powered by multiple types of databases and data stores — each satisfying specific needs and often interacting with a broader data ecosystem. Even the very notion of “a database” is evolving as new hardware architectures and methodologies allow for ever-greater capabilities and expectations for horizontal and vertical scalability, performance, and reliability.
In this webinar, ScyllaDB Director of Technology Advocacy Peter Corless will survey the current landscape of distributed database systems and highlight new directions in the industry.
This talk will cover different database and database-adjacent technologies as well as describe their appropriate use cases, patterns and antipatterns with a focus on:
- Distributed SQL, NewSQL and NoSQL
- In-memory datastores and caches
- Streaming technologies with persistent data storage
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
by Darin Briskman, Technical Evangelist, AWS
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process. We’ll learn about AWS Database Migration Service and AWS Schema Migration Tool, which were recently enhanced to import data from six common data warehouse platforms. Level: 200
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Out of the box, Accumulo's strengths are difficult to appreciate without first building an application that showcases its capabilities to handle massive amounts of data. Unfortunately, building such an application is non-trivial for many would-be users, which affects Accumulo's adoption.
In this talk, we introduce Datawave, a complete ingest, query, and analytic framework for Accumulo. Datawave, recently open-sourced by the National Security Agency, capitalizes on Accumulo's capabilities, provides an API for working with structured and unstructured data, and boasts a robust, flexible, and scalable backend.
We'll do a deep dive into Datawave's project layout, table structures, and APIs in addition to demonstrating the Datawave quickstart—a tool that makes it incredibly easy to hit the ground running with Accumulo and Datawave without having to develop a complete application.
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
Apache Spark is one of the most popular big data systems, but once the shiny finish starts to wear off you can find yourself wondering if you've accidentally deployed a Ford Pinto into production. This talk will look at the challenges that come with scaling Spark jobs. Also, the talk will explore Spark's new(ish) Dataset/DataFrame API, as well as how it’s evolving in Spark 2.3 with improved Python support.
If you're already a Spark user, come to find out why it’s not all your fault. If you aren't already a Spark user, come to find out how to save yourself from some of the pitfalls once you move beyond the example code.
Check out Holden's newest book, High Performance Spark, for more information!
From https://niketechtalksjan2018.splashthat.com/
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
Apache Parquet is a popular big data file format, which is all some features like column encryption, ZSTD, etc. But migrating other file formats like ORC to Parquet on a large scale is challenging. We share our ground experience of why/how we migrate and the lessons learned.
Similar to AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive (20)
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
Couchbase is a popular open source NoSQL platform used by giants like Apple, LinkedIn, Walmart, Visa and many others and runs on-premise or in a public/hybrid/multi cloud.
Couchbase has a sub-millisecond K/V cache integrated with a document based DB, a unique and many more services and features.
In this session we will talk about the unique architecture of Couchbase, its unique N1QL language - a SQL-Like language that is ANSI compliant, the services and features Couchbase offers and demonstrate some of them live.
We will also discuss what makes Couchbase different than other popular NoSQL platforms like MongoDB, Cassandra, Redis, DynamoDB etc.
At the end we will talk about the next version of Couchbase (6.5) that will be released later this year and about Couchbase 7.0 that will be released next year.
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
achine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
קוראים לי ניצן אור קדראי ואני עומדת בצומת המעניינת שבין טכנולוגיה, מדיה ואקטיביזם.
בארבע וחצי השנים האחרונות אני עובדת בידיעות אחרונות, בהתחלה כמנהלת המוצר של אפליקציית ynet וכיום כמנהלת החדשנות.
הייתי שותפה בהקמת עמותת סטארט-אח, עמותה המספקת שירותי פיתוח ומוצר עבור עמותות אחרות, ולאחרונה מתעסקת בהקמת קהילה שמטרתה לחקור את ההיבטים הטכנולוגיים של תופעת הפייק ניוז ובניית כלים אפליקטיביים לצורך ניהול חכם של המלחמה בתופעה.
ההרצאה תדבר על תופעת הפייק ניוז. נתמקד בטכנולוגיה שמאפשרת את הפצת הפייק ניוז ונראה דוגמאות לשימוש בטכנולוגיה זו.
נבחן את היקף התופעה ברשתות החברתיות ונלמד איך ענקיות הטכנולוגיה מנסות להילחם בה.
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
MAKING YOUR ANALYTICS TALK BUSINESS
Aligning your analysis to the business is fundamental for all types of analytics (digital or product analytics, business intelligence, etc) and is vertical- and tool agnostic. In this talk we will build on the discussion that was started in the previous meetup, and will discuss how analysts can learn to derive their stakeholders' expectations, how to shift from metrics to "real" KPIs, and how to approach an analysis in order to create real impact.
This session is primarily geared towards those starting out into analytics, practitioners who feel that they are still struggling to prove their value in the organization or simply folks who want to power up their reporting and recommendation skills. If you are already a master at aligning your analysis to the business, you're most welcome as well: join us to share your experiences so that we can all learn from each other and improve!
Bios:
Eliza Savov - Eliza is the team lead of the Customer Experience and Analytics team at Clicktale, the worldwide leader in behavioral analytics. She has extensive experience working with data analytics, having previously worked at Clicktale as a senior customer experience analyst, and as a product analyst at Seeking Alpha.
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
In the talk we will discuss how to break down the company’s overall goals all the way to your BI team’s daily activities in 3 simple stages:
1. Understanding the path to success - Creating a revenue model
2. Gathering support and strategizing - Structuring a team
3. Executing - Tracking KPIs
Bios:
Omri Halak -Omri is the director of business operations at Logz.io, an intelligent and scalable machine data analytics platform built on ELK & Grafana that empowers engineers to monitor, troubleshoot, and secure mission-critical applications more effectively. In this position, Omri combines actionable business insights from the BI side with fast and effective delivery on the Operations side. Omri has ample experience connecting data with business, with previous positions at SimilarWeb as a business analyst, at Woobi as finance director, and as Head of State Guarantees at Israel Ministry of Finance.
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
Lecturer has Deep experience defining Cloud computing, security models for IaaS, PaaS, and SaaS architectures specifically as the architecture relates to IAM. Deep Experience Defining Privacy protection Policy, a big fan of GDPR interpretation.
DeelExperience in Information security, Defining Healthcare security best practices including AI and Big Data, IT Security and ICS security and privacy controls in the industrial environments.
Deep knowledge of security frameworks such as Cloud Security Alliance (CSA), International Organization for Standardization (ISO), National Institute of Standards and Technology (NIST), IBM ITCS104 etc.
What Will You learn:
Every day, the website collects a huge amount of data. The data allows to analyze the behavior of Internet users, their interests, their purchasing behavior and the conversion rates. In order to increase business, big data offers the tools to analyze and process data in order to reveal competitive advantages from the data.
What Healthcare has to do with Big Data
How AI can assist in patient care?
Why some are afraid? Are there any dangers?
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
Building a low latency (sub millisecond), high throughput database that can handle big data AND linearly scale is not easy - but we did it anyway...
In this session we will get to know Aerospike, an enterprise distributed primary key database solution.
- We will do an introduction to Aerospike - basic terms, how it works and why is it widely used in mission critical systems deployments.
- We will understand the 'magic' behind Aerospike ability to handle small, medium and even Petabyte scale data, and still guarantee predictable performance of sub-millisecond latency
- We will learn how Aerospike devops is different than other solutions in the market, and see how easy it is to run it on cloud environments as well as on premise.
We will also run a demo - showing a live example of the performance and self-healing technologies the database have to offer.
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS
-Learn how to connect BI and product management to solve business problems
-Discover how to lead clients to ask the right questions to get the data and insight they really want
-Get pointers on saving your time and your company's resources by understanding what your customers need, not what they ask for
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Before juming into a Multi cloud strategy, take a moment to review all the different challenges you are likely encounter for a Disaster recovery or High availablity site on AWS GCP or Azure
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
7. Athena Demo
● Features
● Console: quick Demo - convert to columnar
● Bugs (annoying compiler errors)
8. Athena & Hive Convert to Colunar example
https://amazon-aws-big-data-demystified.ninja/2018/05/15/converting-tpch-data-
from-row-based-to-columnar-via-hive-or-sparksql-and-run-ad-hoc-queries-via-
athena-on-columnar-data/
9. Behind the scenes
● Uses presto for queries
○ Runs in memory
○ (google documentation presto)
● Uses hive for
○ ddl function,
○ complex data types
○ Save temp results to disk
● Relies heavily on parquet
○ Compression
○ Meta data for aggregations
11. Billing
● Canceled queries, are not billed, even if they scan data
for an hour!
● Billing is on compressed data, not uncompressed,
good for end user.
● 5$ per TB;
12. Connection
● Web GUI
● JDBC, but has wrapper to other languages.
● http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
● Quicksight, SQL workbench etc.
13. Serde
● Serde are pre installed
● All formats are supported in compression
○ Parquet - snappy is default, you can change it (decompress is fast)
○ Orc - zlib compression
○ Apache web server - server logs - RegexSerDe
14. Parquet vs Text
● Parquet
○ Colunar
○ Schema segregation into footer
● Text gzip = not colunar. But compressed.
18. Athena - Summary
● Use Athena when u get started
○ Ad hoc query
○ Cheap
○ Forces you to work external all the way.
● If you fully understand how to work with Athena → you understand big
data@aws
○ It will be very much the same in hive
○ It will be very much the same in sparkSQL
23. Getting started
● Create schema, Make sure spectrum is available
in your region
○ create external schema spectrum
○ from data catalog
○ database 'spectrumdb'
○ iam_role
'arn:aws:iam::506754145427:role/mySpectrumRole'
○ create external database if not exists;
● Supported data type:
http://docs.aws.amazon.com/redshift/latest/dg/
r_CREATE_EXTERNAL_TABLE.html
● Bucket and cluster must be in same region
24. partitions?
● Manually add each partitions?
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABL
E.html
● Tip - use hive and Athena msck repair cmd.
25. AWS Redshift Spectrum Tips
● more cluster nodes = more spectrum slices = more performance
● smaller nodes = more concurrency
● Be Sure to understand local vs External
● Make sure you understand the data use case
● Redshift Spectrum doesn't support nested data types, such as STRUCT, ARRAY, and MAP. –> need to customize
solution for this via HIVE.
● data type conversions: string → varchar, double → double precision → done.
● DO
○ GROUP BY clauses
○ Comparison conditions and pattern-matching conditions, such as LIKE.
○ Aggregate functions, such as COUNT, SUM, AVG, MIN, and MAX.
○ String functions.
○ to bring back a huge amount of data from S3 into Amazon Redshift to transform and process.
● DO NOT: DISTINCT and ORDER BY.
○ doesn’t support DATE as a regular data type or the DATE transform function.
○ Be careful about putting large tables that frequently join together in Amazon S3
○ CTAS is not supported from External table to External Table. i.e you can not write to external table - only
26. Redshift Spectrum Summary
● Spectrum →
○ requires redshift cluster
○ External Table READ ONLY! (no write)
● Work with spectrum →
○ if you have a huge hd hoc query (aggregations)
○ If want to remove some data from redshift data to s3, and later on analize it.
28. EMR recap
● Hadoop Architecture
○ Master
○ Core
○ Task
○ HDFS
○ Yarn (container)
○ Engine: MR, TEZ
● Scale out/scale up
● Hadoop anti pattern - Join.
● AWS GLUE -
○ shared meta store…
○ And more, but not the topic for today.
29. EMR DEMO
● Console - how to create custom cluster
○ Show all tech options
○ GLUE (prestor, spark, hive)
○ Cofig
■ Maximize resource allocation
■ Dynamic resource allocation
■ Config Path
○ Bootstrap / step
○ Uniform instance/ fleet instances
○ Custom AMI
○ Roles
○ Show security
○ Cli to create cluster
30. EMR
● Tips on creating cheap cluster + performances
○ Auto scaling - based on
■ yarn available memory
■ Container Pending Ratio
○ Spots - bidding strategy
○ New instance group
○ Tasks instance with auto scaling!
○ Same size task as data node
31. EMR summary
● Use custom cluster
○ Get to know: maximize resource allocation
○ Experiment with all the open source options (hue,zeppelin,oozie, ganglia)
● User Glue to share meta store
● Use task nodes (even without autoscaling, u can kill it with no impact)
● When you are ready
○ Spot instances
○ Auto scaling
33. ● Console
○ Hive over Hue
○ Hive over CLI
○ Hive over JDBC
● Create external table location S3 text
● Data types
● Serde
● Create external table location S3 parquet
● Json
● External table
● Convert to columnar with paritions - aws example
● Insert overwrite + dynamic partition
Hive Agenda
34. Hive is not...
● Not A design for OnLine Transaction Processing (OLTP)
● Not A language for real-time queries and row-level updates
35. Hive is...
● It stores schema in a database and
processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for
querying called HiveQL or HQL.
● Configuring Metastore means specifying
to Hive where the database is stored
37. Data Types
● Column Types
a. int/big int
b. Strings: char/varchar
c. Timestamp, dates
d. Decimals
e. Union : a set of of several data types
● Literals
a. Floating point, decimal point, null
● Complex Types
a. Arrays, struct, maps!
38. Supported file formats
● TEXTFILE (CSV, JSON)
● SEQUENCEFILE (Sequence files act as a container to store the small
files.)
○ Uncompressed key/value records.
○ Record compressed key/value records – only ‘values’ are
compressed here
○ Block compressed key/value records – both keys and values are
collected in ‘blocks’ separately and compressed. The size of the
‘block’ is configurable.
● ORC (recommend for hive, local tables, acid transactions such as
delete/update)
● Parquert(recommended for spark and External Table)
39. SerDe
● SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO.
The interface handles both serialization and deserialization and also interpreting
the results of serialization as individual fields for processing.
● A SerDe allows Hive to read in data from a table, and write it back out to HDFS in
any custom format. Anyone can write their own SerDe for their own data formats.
● Supported:
● Avro (Hive 0.9.1 and later), http://avro4s-ui.landoop.com
● ORC (Hive 0.11 and later)
● RegEx
● Thrift
● Parquet (Hive 0.13 and later)
● CSV (Hive 0.14 and later)
● JsonSerDe (Hive 0.12 and later in hcatalog-core)
● For Hive releases prior to 0.12, Amazon provides a JSON SerDe available at
s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar.
40. File format summary
● CSV/ json ---> text
● Smaller than block size → sequence file
● Analytics: ORC/Parquet columnar based
● Avro → Row based. Used for intensive write use cases
41. Create table as parquet - local table
CREATE TABLE parquet_test (
id int,
str string,
mp MAP<STRING,STRING>,
lst ARRAY<STRING>,
struct STRUCT<A:STRING,B:STRING>)
PARTITIONED BY (part string)
STORED AS PARQUET;
42. External Table
Hive tables can be created as EXTERNAL or
INTERNAL. This is a choice that affects how data is
loaded, controlled, and managed.
Use EXTERNAL tables when:
1. The data is also used outside of Hive. For
example, the data files are read and processed
by an existing program that doesn't lock the files.
2. Data needs to remain in the underlying location
even after a DROP TABLE.
43. Convert to Columnar
http://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
● Write to S3 bucket - if you make a typo in the bucket name, the data will be
written. Just not in S3… :)
● Create External table on source data from s3 bucket;
● You need to manage the partitions via msck, identify partitions that were
manually added to the distributed file system
● create target table on s3 as parquet
● insert data from source to destination.
● Query data on s3, as parquet , directly from Hive.
● Think files ==> not one file at a time.
44. Json SerDe Example
https://github.com/rcongiu/Hive-JSON-Serde
CREATE TABLE json_test1 (
one boolean,
three array<string>,
two double,
four string )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
https://github.com/quux00/hive-json-schema
How to add a serde to EMR hive?
ADD JAR /home/hadoop/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar
http://stackoverflow.com/questions/26644351/cannot-validate-serde-org-openx-data-jsonserde-jsonserde
45. Hive Schema From Json
https://github.com/quux00/hive-json-schema
How to add a serde to EMR hive?
ADD JAR /home/hadoop/json-serde-1.3.8-SNAPSHOT-jar-with-
dependencies.jar
How get schema from json file:
java -jar target/json-hive-schema-1.0-jar-with-
dependencies.jar file.json my_table_name
http://stackoverflow.com/questions/26644351/cannot-validate-
serde-org-openx-data-jsonserde-jsonserde
46. Example (schema from json)
https://amazon-aws-big-data-demystified.ninja/2018/05/17/getting-a-sql-
schema-from-json/
50. Why Use ORC ?
1. ORC has performance optimizations
2. ORC has transaction : delete/update
3. ORC has bucketing (index...)
4. ORC suppose to be faster than Parquet
5. Parquet might be better if you have highly nested data, because it stores its elements as a tree like
Google Dremel does (See here).
6. Apache ORC might be better if your file-structure is flattened.
7. when Hive queries ORC tables GC is called about 10 times less frequently. Might be nothing for
many projects, but might be crucial for others.
51. Why Use Parquet
WHY NOT ORC: Couple of considerations for Parquet over ORC in Spark are:
1) Easily creation of Dataframes in spark. No need to specify schemas.
2) Worked on highly nested data.
3) works well with spark.Spark and Parquet is good combination
4. Also, ORC compression is sometimes a bit random, while Parquet compression is much more
consistent. It looks like when ORC table has many number columns - it doesn't compress as well. It
affects both zlib and snappy compression
52. Why Use ORC/Parquet
Confusing parts:
https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
● Hive has a vectorized ORC reader but no vectorized parquet reader.
● Spark has a vectorized parquet reader and no vectorized ORC reader.
● Spark performs best with parquet, hive performs best with ORC.
53. Hive Summary
● Understand the following concepts
○ Colunar
○ External table
○ Json parsing
○ ACID tables
○ ORC/Parquet
○ Lateral View + explode
● Bottom line demystified
○ Work with external tables all the way!
○ Use parquet (for future work with spark… )
○ ACID → use ORC + Hive
○ Use Serde to parse raw data
○ Use dynamic partitions when possible (carefully)
○ Use hive to convert data to what u need - insert OVERWRITE
54. Big Data SQL performance Tips
@ AWS | Lesson Learned
AWS Big Data Demystified
Omid Vahdaty, Big Data Ninja
55. Generally speaking be familiarized with
1. When to normalize/denormalize
2. Columnar VS Row based
3. Storage types: AVRO/Parquet/ORC
4. Compression
5. Complex data types
6. When to partition? How much?
7. Processing Int is faster than string by an order of X10
8. What is the faster DB in the world? [Hint: what is your
use case?]
9. Network latency from Client to Server?
10. Encryption at rest / in motion = performance impact?
56. Tips for external / local tables when to use
what/which
Use local table when
1. when using only analytics DB such as redshift , and you don't need a data lake
2. when data is small and temporary insert takes time
3. When performance is everything (be prepared to pay 5 to 10 times more on external)
4. when you need to insert temporary results of your query - and there is no option to write to an external table (hive supports write to an
external table, but athena doest)
Use External table when
1. cost is an issue - use transient clusters
2. Your data is already on S3.
3. when you need several DB's for several use cases, and you want to avoid insert/ETL's
4. when you want to decouple the compute power and storage power :
5. i.e athena & spectrum - infinite compute , infinite storage. pay on what you use only.
57. Redshift Redshift spectrum Hive Athena
Cost High low medium low
Performance top 10 in the world. fast enough... slow... fast enough...
Syntax Postgres postgres Hive Presto
Data types
advantages
no arrays no arrays complex data types complex data types
Storage type Colunar Colunar Columnar, and Row Columnar, and Row
Usecase Joins , traditional DBMS, analytics:
Joins, AGG , order by
Aggregations ONLY, transformation , advanced parsing,
transient clusters.
ad hoc querying, not
for big data.
Anti pattern temporary cluster Joins Joins, quick and dirty, simplicity Joins / Big Data /
Inserts
58. Performance Tips for modeling
1. choose correct partition . [ dt? win time? ]
2. big data anti pattern - usage of join... use flat table whenever possible. easier to calculate.
3. Static Raw data (one time job as data enters) = Precalculate what you need on a daily basis =
storage is cheap...
a. lookup tables - convert to int when possible, or even boolean if exist. dont use "like"
b. datepart of wintime = can you pre calculate into fact table?
c. minimize likes
d. string to int/boolean/bit when possible.
e. case = can you pre calculate into fact table?
f. coalesce = can you pre calculate into fact table?
g. calculate group by of indexes (bluekai/gaid) values before the join job in a separate job-->
reduce running time of join.
4. Dynamic data ( recurring daily job ) compute is expensive...
a. filter data by the same time interval across all fact tables
b. filter rows not needed across all fact tables
59. If you must join...
1. notice the order of the tables - join - from small to big.
2. filter as much as possible
3. use only columns you must.
4. Use explain to understand the query you are writing
5. use explain to minimize raws (small table X small table = maybe equals big
table)
6. copy small tables to all data nodes (redshift/hive)
7. use hints if possible.
8. Divide the job to smaller atomic steps
60. Tips to avoid join
1. use flat tables with many columns - storage is cheap
2. use complex data types such as arrays, and nested arrays.
61. Hive tuning tips
1. Avoid order by if possible…
2. Minimize reduces.
3. Config suggested:
a. set hive.exec.parallel=true;
b. set hive.exec.parallel.thread.number=24;
c. set hive.tez.container.size=4092; (check this one carefully)
d. set hive.exec.parallel=true;
e. set hive.support.concurrency=true;
f. set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set mapred.compress.map.output=true;
set hive.execution.engine=mr;
63. Performance Summary
● Partitions…
● External table VS Local table
● Flat tables + complex data types VS Join
● Compression
● Columnar → Parquet
64. Lecture summary - starting with big data?
● Start with athena
● Have already redshift? Consider spectrum
● Use EMR Hive to transform data from any structured semistructured data to
parquet
● Fully nested? consider AVRO
65. Complex Q&A from the audience - post lecture notes
● When to use redshift? And when to use EMR (spark SQL, hive, presto)
○ https://amazon-aws-big-data-demystified.ninja/2018/06/03/when-should-we-emr-and-when-
to-use-redshift/
● Cost reduction on Athena:
○ https://amazon-aws-big-data-demystified.ninja/2018/06/03/cost-reduction-on-athena/
66. Stay in touch...
● Omid Vahdaty
● +972-54-2384178
● https://amazon-aws-big-data-demystified.ninja/
● Join our meetup, FB group and youtube channel
○ https://www.meetup.com/AWS-Big-Data-Demystified/
○ https://www.facebook.com/groups/amazon.aws.big.data.demystified/
○ https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber