1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates.
2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization.
3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.
As part of NoSQL series, I presented Google Bigtable paper. In presentation I tried to give some plain introduction to Hadoop, MapReduce, HBase
www.scalability.rs
Talk given on state of NUMA with Java databases such as Cassandra and how it can improved / ameliorated, and compared with traditional storage engines.
Software Defined Substation Intelligence, Automation and ControlBastian Fischer
The Intelligent Digital Substation - Future Proof by Design
A combination of societal, technological, and environmental factors are transforming the energy industry in-depth. The continuous increase of renewable and intermittent energy sources; the necessity to improve grid reliability and power quality; and regulatory pressure to reduce operating expenditures on grid assets require investments today while being future proof for decades to come.
Electrical grids are evolving in complexity, in structure and in function to enable the bi-directional flow of energy, of information and transactions. The integration of distributed, intermittent energy resources require constant network balancing, real-time adjustments of supply and demand, dynamic asset rating, dynamic protection schemes, advanced automation and is only possible with a new substation platform.
Electrical substations are the critical nodes of this grid evolution and hence in order to make the grid digital and intelligent we need first to make substations digital and intelligent. SASensor is architected along the Centralized Protection and Control principles and provides already today the benefits of a data driven and software defined implementation. This is making your investments future proof as functions can merely be applied by software upgrades during the entire life cycle of the substation.
SASensor is a substation platform transforming your substations into intelligent hubs providing a new level of functionalities, applications and performance, a new level of situational awareness and high resolution real-time data providing insight into operation, diagnostics and asset conditions.
SASensor is providing a large set of protection, automation, communication and measurement functionalities based on a high availability redundant computing platform with efficient remote software-, data-, user- and configuration management along with resilient cyber-security features.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
As part of NoSQL series, I presented Google Bigtable paper. In presentation I tried to give some plain introduction to Hadoop, MapReduce, HBase
www.scalability.rs
Talk given on state of NUMA with Java databases such as Cassandra and how it can improved / ameliorated, and compared with traditional storage engines.
Software Defined Substation Intelligence, Automation and ControlBastian Fischer
The Intelligent Digital Substation - Future Proof by Design
A combination of societal, technological, and environmental factors are transforming the energy industry in-depth. The continuous increase of renewable and intermittent energy sources; the necessity to improve grid reliability and power quality; and regulatory pressure to reduce operating expenditures on grid assets require investments today while being future proof for decades to come.
Electrical grids are evolving in complexity, in structure and in function to enable the bi-directional flow of energy, of information and transactions. The integration of distributed, intermittent energy resources require constant network balancing, real-time adjustments of supply and demand, dynamic asset rating, dynamic protection schemes, advanced automation and is only possible with a new substation platform.
Electrical substations are the critical nodes of this grid evolution and hence in order to make the grid digital and intelligent we need first to make substations digital and intelligent. SASensor is architected along the Centralized Protection and Control principles and provides already today the benefits of a data driven and software defined implementation. This is making your investments future proof as functions can merely be applied by software upgrades during the entire life cycle of the substation.
SASensor is a substation platform transforming your substations into intelligent hubs providing a new level of functionalities, applications and performance, a new level of situational awareness and high resolution real-time data providing insight into operation, diagnostics and asset conditions.
SASensor is providing a large set of protection, automation, communication and measurement functionalities based on a high availability redundant computing platform with efficient remote software-, data-, user- and configuration management along with resilient cyber-security features.
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data. Some of this data comes in the form of business transactions and is stored in a relational database. This relational data is often combined with other non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for us as data integration professionals is to then combine this data and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution. This session will walk through the implementation of GoldenGate and Kafka.
Presented at Collaborate16 in Las Vegas.
The tech talk was gieven by Ranjeeth Kathiresan, Salesforce Senior Software Engineer & Gurpreet Multani, Salesforce Principal Software Engineer in June 2017.
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent
Mark Teehan, Principal Solutions Engineer, Confluent
Use the Debezium CDC connector to capture database changes from a Postgres database - or MySQL or Oracle; streaming into Kafka topics and onwards to an external data store. Examine how to setup this pipeline using Docker Compose and Confluent Cloud; and how to use various payload formats, such as avro, protobuf and json-schema.
https://www.meetup.com/Singapore-Kafka-Meetup/events/276822852/
HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the biggest and most exciting milestone release from the Apache community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Existing users of HBase/Phoenix as well as operators managing HBase clusters will benefit the most where they can learn about the new release and the long list of features. We will also briefly cover earlier 1.x release lines and compatibility and upgrade paths for existing users and conclude by giving an outlook on the next level of initiatives for the project.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Performance of Microservice frameworks on different JVMsMaarten Smeets
A lot is happening in world of JVMs lately. Oracle changed its support policy roadmap for the Oracle JDK. GraalVM has been open sourced. AdoptOpenJDK provides binaries and is supported by (among others) Azul Systems, IBM and Microsoft. Large software vendors provide their own supported OpenJDK distributions such as Amazon (Coretto), RedHat and SAP. Next to OpenJDK there are also different JVM implementations such as Eclipse OpenJ9, Azul Systems Zing and GraalVM (which allows creation of native images). Other variables include different versions of the JDK used and whether you are running the JDK directly on the OS or within a container. Next to that, JVMs support different garbage collection algorithms which influence your application behavior. There are many options for running your Java application and choosing the right ones matters! Performance is often an important factor to take into consideration when choosing your JVM. How do the different JVMs compare with respect to performance when running different Microservice implementations? Does a specific framework provide best performance on a specific JVM implementation? I've performed elaborate measures of (among other things) start-up times, response times, CPU usage, memory usage, garbage collection behavior for these different JVMs with several different frameworks such as Reactive Spring Boot, regular Spring Boot, MicroProfile, Quarkus, Vert.x, Akka. During this presentation I will describe the test setup used and will show you some remarkable differences between the different JVM implementations and Microservice frameworks. Also differences between running a JAR or a native image are shown and the effects of running inside a container. This will help choosing the JVM with the right characteristics for your specific use-case!
Vladimir Rodionov (Hortonworks)
Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
Dean will provide practical tips and techniques learned from helping hundreds of customers deploy InfluxDB and InfluxDB Enterprise. This includes hardware and architecture choices, schema design, configuration setup, and running queries.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
The tech talk was gieven by Ranjeeth Kathiresan, Salesforce Senior Software Engineer & Gurpreet Multani, Salesforce Principal Software Engineer in June 2017.
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent
Mark Teehan, Principal Solutions Engineer, Confluent
Use the Debezium CDC connector to capture database changes from a Postgres database - or MySQL or Oracle; streaming into Kafka topics and onwards to an external data store. Examine how to setup this pipeline using Docker Compose and Confluent Cloud; and how to use various payload formats, such as avro, protobuf and json-schema.
https://www.meetup.com/Singapore-Kafka-Meetup/events/276822852/
HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the biggest and most exciting milestone release from the Apache community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Existing users of HBase/Phoenix as well as operators managing HBase clusters will benefit the most where they can learn about the new release and the long list of features. We will also briefly cover earlier 1.x release lines and compatibility and upgrade paths for existing users and conclude by giving an outlook on the next level of initiatives for the project.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Performance of Microservice frameworks on different JVMsMaarten Smeets
A lot is happening in world of JVMs lately. Oracle changed its support policy roadmap for the Oracle JDK. GraalVM has been open sourced. AdoptOpenJDK provides binaries and is supported by (among others) Azul Systems, IBM and Microsoft. Large software vendors provide their own supported OpenJDK distributions such as Amazon (Coretto), RedHat and SAP. Next to OpenJDK there are also different JVM implementations such as Eclipse OpenJ9, Azul Systems Zing and GraalVM (which allows creation of native images). Other variables include different versions of the JDK used and whether you are running the JDK directly on the OS or within a container. Next to that, JVMs support different garbage collection algorithms which influence your application behavior. There are many options for running your Java application and choosing the right ones matters! Performance is often an important factor to take into consideration when choosing your JVM. How do the different JVMs compare with respect to performance when running different Microservice implementations? Does a specific framework provide best performance on a specific JVM implementation? I've performed elaborate measures of (among other things) start-up times, response times, CPU usage, memory usage, garbage collection behavior for these different JVMs with several different frameworks such as Reactive Spring Boot, regular Spring Boot, MicroProfile, Quarkus, Vert.x, Akka. During this presentation I will describe the test setup used and will show you some remarkable differences between the different JVM implementations and Microservice frameworks. Also differences between running a JAR or a native image are shown and the effects of running inside a container. This will help choosing the JVM with the right characteristics for your specific use-case!
Vladimir Rodionov (Hortonworks)
Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
Dean will provide practical tips and techniques learned from helping hundreds of customers deploy InfluxDB and InfluxDB Enterprise. This includes hardware and architecture choices, schema design, configuration setup, and running queries.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm
To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637
The growth of the amount of medical image data produced on a daily basis in modern hospitals forces the adaptation of traditional medical image analysis and indexing approaches towards scalable solutions. In this work, MapReduce is used to speed up and make possible three large–scale medical image processing use–cases: (i) parameter optimization for lung texture classification using support vector machines (SVM), (ii) content–based medical image indexing, and (iii) three–dimensional directional wavelet analysis for solid texture classification.
Our secure remote connectivity tool provides full video recording of all work our engineers perform on client systems. We have requirements to analyze the video log to detect suspicious activity, provide forensic and root cause analysis capabilities. Some of the obvious use cases include detection of credit card patterns or personally identifiable information (PII) as well as malicious activity like dropping database objects. We need to process hundreds of gigabytes per day representing thousands of hours of video. Our solution leverages a variety of Hadoop components to perform optical text recognition and indexing, keyboard and mouse movement analysis as well as integration with variety of other data sources such as our monitoring, documentation, ticketing and communication systems. We will present our complete architecture starting from multi-source data ingestion through data processing and analysis up to the end user interface, reporting and integration layer.
Big Data - The 5 Vs Everyone Must KnowBernard Marr
This slide deck, by Big Data guru Bernard Marr, outlines the 5 Vs of big data. It describes in simple language what big data is, in terms of Volume, Velocity, Variety, Veracity and Value.
My idea of a new kind of reminder tool that shows reminders in a much less intrusive way and is intelligent enough not to disturb you when you are busy with tasks more important than the one it wants to remind you about.
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
This presentation was given by Will Palmer at ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
In this presentation Will Palmer introduced Hadoop and the way the British Library and SCAPE have used Hadoop to process large-scale data.
An Introduction to Big Data, Hadoop architecture, HDFS and MapReduce. Some concepts are explained through animation which is best viewed by downloading and opening in PowerPoint.
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Cloudera, Inc.
Skybox Imaging is using Hadoop as the engine of it's satellite image processing system. Using CDH to store and process vast quantities of raw satellite image data enables Skybox to create a system that scales as they launch larger numbers of ever more complex satellites. Skybox has developed a CDH based framework that allows image processing specialists to develop complex processing algorithms using native code and then publish those algorithms into the highly scalable Hadoop Map/Reduce interface. This session will provide an overview of how we use hdfs, hbase and map/reduce to process raw camera data into high resolution satellite images.
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
Generic presentation about Big Data Architecture/Components. This presentation was delivered by David Pilato and Tugdual Grall during JUG Summer Camp 2015 in La Rochelle, France
There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.
Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts...
Speakers
Anant Chintamaneni, VP Products, BlueData
Nanda Vijaydev, Director Solutions, BlueData
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark
By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes.
Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together.
R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics.
This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS .
- How to Scale R
- Work with R and Hadoop + Spark
-Demo of MLS on HDP/HDInsight server with RStudio
- How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio:
Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.
Link to the full talk - https://youtu.be/2Rf5t2Eh6IQ
https://go.dok.community/slack
https://dok.community
ABSTRACT OF THE TALK
This talk will provide a high-level overview of Kubernetes, Helm charts and how they can be used to deploy Apache Druid clusters of any size.
We'll review how Kubernetes functionality enables resilience and self-healing, historical tiers through node group affinity, middle manager scaling through Kubernetes autoscaling to optimize ingestion capacity and some of the gotchas along the way.
BIO
Sergio Ferragut is a database veteran turned Developer Advocate at Imply. His experience includes 16 years at Teradata in professional services and engineering roles.
He has direct experience in building analytics applications spanning the retail, supply chain, pricing optimization and IoT spaces.
Sergio has worked at multiple technology start-ups including APL and Splice Machine where he helped guide product design and field messaging.
SQL and Machine Learning on Hadoop using HAWQpivotalny
It is true to the extent it is almost considered rhetorical to say
“Many Enterprises have adopted HDFS as the foundational layer for their Data Lakes. HDFS provides the flexibility to store any kind of data and more importantly it’s infinitely scaleable on commodity hardware.”
But the conundrum till date is the solution for a low latency query engine for HDFS.
At Pivotal, we cracked that problem and the answer is HAWQ, which we intend to open source this year. During this event, we will present and demo HAWQ’s Architecture, it’s powerful ANSI SQL features and it’s ability to transcend traditional BI in the form of in-database analytics (or machine learning).
Similar to A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
NEWNTIDE, a leading brand in China's air energy industry, drives industry development with technological innovation, implementing national energy-saving and emission reduction policies. It pioneers an industry-focused multi-energy product line, adopting experiential marketing to meet diverse customer needs. The company has departments for R&D, marketing, operations, and sales, aiming to ultimately achieve "technological innovation, environmental friendliness, standardized management, and high-quality" as a high-tech enterprise integrating business and technical R&D, production, sales, and service.
NEWNTIDE boasts the most comprehensive support service network in the industry. Its earliest products cover 25 series, including split, integrated, wall-mounted, cabinet, and upright types, with over 100 diverse products. Commercial products include floor heating, air heaters, air conditioners for heating and cooling, oxidation and nitrogen air conditioners, and high-temperature heating. The products feature comprehensive intelligent technology management, cloud control technology, rapid heating technology, basic protection technology, remote control technology, DC inverter technology, and remote WIFI smart control, achieving a leading position in the industry with SMART interactive technology.
For over a decade, the company has adhered to a "people-oriented" business philosophy, strictly implementing industry 7S management, ISO9001/ISO14001 quality and environmental systems, and industry standards to ensure stable product quality and meet customers' dual requirements for product safety and environmental protection.
Leading the development of intelligence with technological innovation, NEWNTIDE has become a national demonstration base for the transformation of scientific and technological achievements, awarded the "China Energy Saving Technology Contribution Award" and "China Energy Science and Technology Progress Award". The company adopts a strategy of high standards, high quality, and high-tech for key products, holding core technologies and competitive advantages. It also organizes multiple strategic support projects known as the "18 Key Operational Projects" and "18 Key Operational Strategies," driving technology project approvals with multidimensional strategic product quality modules and comprehensive practical operations to enhance the quality of all products.
Since its establishment, NEWNTIDE has always committed to providing high-quality and high-end intelligent heat pump products, serving billions of global families with the goal of creating a sustainable and prosperous environment. The development of NEWNTIDE has been supported by various levels of government and widely recognized and cooperated with by internationally renowned institutions, taking on a social responsibility of providing tranquility and happiness while enjoying the environment.
Let safe heat pumps be a necessity for a beautiful human life.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
4. 4
Why Hadoop?
● High Scalability
● Store historical data of Images
● Fault tolerance
● Identify updates to images on basis of content of
URL
5. 5
HIP Components
1. HBase: Datastore for Images and archiving Images
2. MapReduce: Computation engine for Image
Processor
3. Kafka: Publisher/Subscriber for pushing images to
Asset Servers
4. OpenCV Java: Image Processing library
5. Avro: Serialization library for storing data on HDFS
6. 6
HBase Data Model
Tables:
1. IMAGE: Store current set of Images with some metadata
2. IMAGE_ARCHIVE: Stores historical data of Vehicles and
Original Images
7. 7
Column Family Description Versions
I • Store all images of vehicle.
• Stores an Image in each Column
1
H • Stores metadata of all Images 1
Table: IMAGE
RowKey: <Vin_Number>
HBase Data Model
Read patterns for “I” and “H” are mutually exclusive
8. 8
Column Family Description Versions
I Store original images of vehicle.
Only 1 column is stored.
10
A Stores fields of Avro Object of Vehicle
and Image for analytics
10
Table: IMAGE_ARCHIVE
RowKey: <Provider_id><Dealer_Id><vehicle_vin><Image_Index>
HBase Data Model
9. 9
HBase Tuning
● Pre-split tables
● Keep Column names short(2-8 letters)
● Region size 8-10 GB
● Asynchronous clients should buffer(autoFlush=false) Put
operations
● Disable periodic Major Compaction
11. CRUD in Reducer
11
Start
Is Deleted?
Yes
Delete Row
in HBase
No
Is Insert?
Yes
Download Images
Generate 6 Sizes
of Image
No Get HTTP Headers of
ImageURL and
Compare with Existing
NoHeader
Mismatch?
Do
Nothing
Yes
1. Write to HBase
2. Write to Kafka
14. 14
Kafka Producer Tuning
Property Value Default Value
request.required.acks 1 0
message.send.max.retries 30 3
retry.backoff.ms 5000 100
client.id HIP “”
For Producer, to sustain NODE failure:
retry.backoff.ms * message.send.max.retries(default:100*3) > Zookeeper Timeout(default:60000)
Failure recovery in
300ms. Really?
15. Kafka Brokers Tuning
Property Value Default Value
log.retention.bytes 24 GB -1(unlimited)
socket.send.buffer.bytes 10485760 1048576
socket.receive.buffer.bytes 10485760 1048576
1. Data is purged when any of log.retention.bytes OR log.retention.hours exceeds.
2. log.retention.bytes = diskspace/number_of_partitions on each node
16. 161616
OpenCV
● Used Java bindings of OpenCV to avoid using Hadoop
Streaming
● Java api is quite straight forward to encode, decode, crop
and resize.
Memory Leak:
Mat.release() has to be used to free up memory used by Mat.
To change OPENING SLIDE background image (placing image inside shape):This must be done on the MASTER LAYOUT: “COVER”
Go to “Slide Master View”.
Right-Click on current background image
In pop-up display select "Format Picture“
Below “SHAPE OPTIONS” and under “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
If necessary … Select Crop Tool drop down and select “Fit” (to insure image is not distorted)
If necessary … Select Crop Tool again to resize and position image inside shape
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change background image on this slide:
Right-Click on current background image
In pop-up display select "Format Background“
Below “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
To change title of the deck in the footer (lower right):
Go to “Slide Master View”.
Select to the SLIDE MASTER. (The large slide with “1”)
In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
To change SECTION BREAK SLIDE background image (placing image inside shape):This must be done on the MASTER LAYOUT: “SECTION#0?”. There are 5 “SECTION” master layouts with different background images.
Go to “Slide Master View”.
Right-Click on current background image
In pop-up display select "Format Picture“
Below “SHAPE OPTIONS” and under “FILL”, Select “Picture or texture fill“
Below “Insert picture from” select “File”
Locate your replacement image where stored on your computer. Click “Insert”
If necessary … Select Crop Tool drop down and select “Fit” (to insure image is not distorted)
If necessary … Select Crop Tool again to resize and position image inside shape