This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Practice of building apache sharding sphere iincubator communityjixuan1989
This talk is introduce by Liang Zhang, who is a PPMC of Apache SahrdingSphere (incubating) project, at Apache Event at Tsinghua University in China.
Liang Zhang comes from JD.com.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Welcome!
Michael Stack, Software Engineer, Cloudera & HBase PMC Chair
9:00-9:05am
Conference MC Michael Stack, Chair of the HBaseCon 2013 Program Committee, welcomes you to the conference and offers a preview of the day.
The Apache HBase Community: Best Ever and Getting Better
Amr Awadallah, CTO and Co-founder, Cloudera
9:05-9:15am
Amr comments on the explosion of interest in Apache HBase over the past few years, how that interest has influenced the Hadoop stack overall, and why Cloudera considers its involvement in the HBase community to be so important.
State of the Apache HBase Union
Michael Stack & Lars Hofhansl, Architect, Salesforce.com
9:15-9:40am
Release-managers-in-crime Michael and Lars offer a look back, and a look forward, at HBase releases and what they have brought us (and will bring us in the future).
The Apache HBase Ecosystem
Aaron Kimball, Chief Architect, WibiData
9:40-10:05am
Today, HBase stands as Apache Hadoop did years ago, a project with a growing and vibrant community in its own right. In this talk, Aaron will overview some of the projects built on top of HBase that you’ll get a chance to learn about during the day – each of these projects having grown out of a need to use HBase for an application that requires real-time atomic access to data. As an example, he’ll present the motivations for Kiji and how it is helping organizations create amazing new applications using HBase and Hadoop.
Overview of Apache HBase at Facebook (Slides Not Available)
Liyin Tang, Software Engineer, Facebook & HBase PMC Member
10:05-10:30am
In this keynote, you’ll get an overview of how HBase is used at Facebook. Explore Facebook’s applications using HBase as an OLTP service, which require high reliability, efficiency, and scalability, and how HBase can tolerate small network glitches and rack failures. You’ll also learn the use cases for adopting HBase as a batch processing service and various optimizations to scale processing throughput. Finally, learn Facebook’s thoughts about the future of HBase.
Apache Bigtop has created the de-facto standard in how Hadoop-based stacks are developed, delivered, and managed. We are at it again! The track will present the composition of the next generation of in-memory computing stack that is completely built out of open-source components. The next generation of the Apache data processing stack will focus on in-memory and transactional processing of large amounts of data. We will also be talking about performance benefits that legacy data-processing software based on MapReduce, Hive, and similar, can derive from in-memory computing. This session will discuss and analyze the benefits of practicing Fast Data in the open.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
Practice of building apache sharding sphere iincubator communityjixuan1989
This talk is introduce by Liang Zhang, who is a PPMC of Apache SahrdingSphere (incubating) project, at Apache Event at Tsinghua University in China.
Liang Zhang comes from JD.com.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Welcome!
Michael Stack, Software Engineer, Cloudera & HBase PMC Chair
9:00-9:05am
Conference MC Michael Stack, Chair of the HBaseCon 2013 Program Committee, welcomes you to the conference and offers a preview of the day.
The Apache HBase Community: Best Ever and Getting Better
Amr Awadallah, CTO and Co-founder, Cloudera
9:05-9:15am
Amr comments on the explosion of interest in Apache HBase over the past few years, how that interest has influenced the Hadoop stack overall, and why Cloudera considers its involvement in the HBase community to be so important.
State of the Apache HBase Union
Michael Stack & Lars Hofhansl, Architect, Salesforce.com
9:15-9:40am
Release-managers-in-crime Michael and Lars offer a look back, and a look forward, at HBase releases and what they have brought us (and will bring us in the future).
The Apache HBase Ecosystem
Aaron Kimball, Chief Architect, WibiData
9:40-10:05am
Today, HBase stands as Apache Hadoop did years ago, a project with a growing and vibrant community in its own right. In this talk, Aaron will overview some of the projects built on top of HBase that you’ll get a chance to learn about during the day – each of these projects having grown out of a need to use HBase for an application that requires real-time atomic access to data. As an example, he’ll present the motivations for Kiji and how it is helping organizations create amazing new applications using HBase and Hadoop.
Overview of Apache HBase at Facebook (Slides Not Available)
Liyin Tang, Software Engineer, Facebook & HBase PMC Member
10:05-10:30am
In this keynote, you’ll get an overview of how HBase is used at Facebook. Explore Facebook’s applications using HBase as an OLTP service, which require high reliability, efficiency, and scalability, and how HBase can tolerate small network glitches and rack failures. You’ll also learn the use cases for adopting HBase as a batch processing service and various optimizations to scale processing throughput. Finally, learn Facebook’s thoughts about the future of HBase.
Apache Bigtop has created the de-facto standard in how Hadoop-based stacks are developed, delivered, and managed. We are at it again! The track will present the composition of the next generation of in-memory computing stack that is completely built out of open-source components. The next generation of the Apache data processing stack will focus on in-memory and transactional processing of large amounts of data. We will also be talking about performance benefits that legacy data-processing software based on MapReduce, Hive, and similar, can derive from in-memory computing. This session will discuss and analyze the benefits of practicing Fast Data in the open.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
Slides from Apache Spark Workshop from Big Data Trunk. It provides a fun way to introduce Apache Spark in the big data world.
www.BigDataTrunk.com
Youtube channel
https://www.youtube.com/channel/UCp7pR7BJNnRueEuLSau0TzA
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
This session discusses the open-source community, its vital place within the AWS ecosystem, and how AWS works to provide seamless integration points. Our speakers share their experiences building and deploying cloud-based open-source projects while also reviewing some of today's most popular and relevant open-source platforms and solutions.
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
Slides from a presentation given by Alison Gilles and Josh Baer during StrataNYC 2017.
Covers the decision, challenge and strategy (technical, organizational, people) for converting Spotify's 2500 node Hadoop cluster's worth of data and processing to Google Cloud.
Finally, touches on Spotify's resulting infrastructure on GCP.
Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...DataStax
Feeling the need to contribute something to Apache Cassandra? Maybe you want to help guide the future of your favorite database? Get off the sidelines and get in the game! It's easy to say but how do you even get started? I will outline some of the ways you can help contribute to Apache Cassandra from minor to major. If you don't have the time or ability to submit code, there are a lot of ways you can participate. What if you do want to write some code? I can walk you through the process of creating a patch and submitting for final approval. Got a great idea? I'll show you propose that to the community at large. Take it from me, participating is so much more fun than just watching the project from a distance. Time to jump in!
About the Speaker
Patrick McFadin Chief Evangelist, DataStax
Patrick McFadin is one of the leading experts of Apache Cassandra and data modeling techniques. As the Chief Evangelist for Apache Cassandra and consultant for DataStax, he has helped build some of the largest and exciting deployments in production. Previous to DataStax, he was Chief Architect at Hobsons and an Oracle DBA/Developer for over 15 years.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhDAdnan Masood
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
In this presentation we discuss Microsoft HDInsight offering of Spark. Azure HDInsight, Microsoft’s managed Hadoop and Spark cloud service that runs the Hortonworks Data Platform. Spark for Azure HDInsight offers customers an enterprise-ready Spark solution that’s fully managed, secured, and highly available and made simpler for users with compelling and interactive experiences.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
Stackato presentation done at the Nordic Perl Workshop 2012 in Stockholm, Sweden
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
Basic performance application optimization techniques that can be applied to any application, from web to desktop or mobile, but with focus on php/mysql stack. How to identify bottlenecks and resolve them and what strategies to choose to avoid them upfront.
Live presentation:
https://www.youtube.com/watch?v=aas8oM7CLjk
Slides from Apache Spark Workshop from Big Data Trunk. It provides a fun way to introduce Apache Spark in the big data world.
www.BigDataTrunk.com
Youtube channel
https://www.youtube.com/channel/UCp7pR7BJNnRueEuLSau0TzA
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
This session discusses the open-source community, its vital place within the AWS ecosystem, and how AWS works to provide seamless integration points. Our speakers share their experiences building and deploying cloud-based open-source projects while also reviewing some of today's most popular and relevant open-source platforms and solutions.
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
Slides from a presentation given by Alison Gilles and Josh Baer during StrataNYC 2017.
Covers the decision, challenge and strategy (technical, organizational, people) for converting Spotify's 2500 node Hadoop cluster's worth of data and processing to Google Cloud.
Finally, touches on Spotify's resulting infrastructure on GCP.
Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...DataStax
Feeling the need to contribute something to Apache Cassandra? Maybe you want to help guide the future of your favorite database? Get off the sidelines and get in the game! It's easy to say but how do you even get started? I will outline some of the ways you can help contribute to Apache Cassandra from minor to major. If you don't have the time or ability to submit code, there are a lot of ways you can participate. What if you do want to write some code? I can walk you through the process of creating a patch and submitting for final approval. Got a great idea? I'll show you propose that to the community at large. Take it from me, participating is so much more fun than just watching the project from a distance. Time to jump in!
About the Speaker
Patrick McFadin Chief Evangelist, DataStax
Patrick McFadin is one of the leading experts of Apache Cassandra and data modeling techniques. As the Chief Evangelist for Apache Cassandra and consultant for DataStax, he has helped build some of the largest and exciting deployments in production. Previous to DataStax, he was Chief Architect at Hobsons and an Oracle DBA/Developer for over 15 years.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhDAdnan Masood
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
In this presentation we discuss Microsoft HDInsight offering of Spark. Azure HDInsight, Microsoft’s managed Hadoop and Spark cloud service that runs the Hortonworks Data Platform. Spark for Azure HDInsight offers customers an enterprise-ready Spark solution that’s fully managed, secured, and highly available and made simpler for users with compelling and interactive experiences.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
Stackato presentation done at the Nordic Perl Workshop 2012 in Stockholm, Sweden
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
Basic performance application optimization techniques that can be applied to any application, from web to desktop or mobile, but with focus on php/mysql stack. How to identify bottlenecks and resolve them and what strategies to choose to avoid them upfront.
Live presentation:
https://www.youtube.com/watch?v=aas8oM7CLjk
Know thy cost (or where performance problems lurk)Oren Eini
Performance happens. Whether you're designed for it or not it doesn’t matter, she is always invited to the party (and you better find her in a good mood). Knowing the cost of every operation, and how it distributes on every subsystem will ensure that when you are building that proof-of-concept (that always ends up in production) or designing the latest’s enterprise-grade application; you will know where those pesky performance bugs like to inhabit. In this session, we will go deep into the inner working of every performance sensitive subsystem. From the relative safety of the client to the binary world of Voron.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
My Stackato presentation given to the CopenhagenJS user group. Basic examples were implemented in Node.
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
My 6th. revision of my Stackato presentation given at the German Perl Workshop 2013 in Berlin, Germany,
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
This is my presentation of ActiveStates stackato given to the Copenhagen Perl Mongers
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
This talk is introduce by Junping Du, who is an Apache member and Hadoop PMC, at Apache Event at Tsinghua University in China.
Junping Du comes from Tencent and is the chairman of TOSA.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Willem Ning Jiang: Getting Started: How to join an Open Source project Apache...jixuan1989
This talk is introduce by Willem Ning Jiang, who is an Apache member and ServiceComb PMC, at Apache Event at Tsinghua University in China.
Willem Ning Jiang comes from Huawei.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
This talk is introduce by Craig L Russell, who is the Apache Software Foundation Chairman, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Chapter 3 - Islamic Banking Products and Services.pptx
From a student to an apache committer practice of apache io tdb
1. 从 Apache IoTDB 看高校学生的
Apache 开源实践
Developing Apache IoTDB:
Practice Experience from Young Students
Xiangdong Huang
Tsinghua University, Beijing, China
2019.11.09
4. Who am I
• Xiangdong Huang (sainthxd@gmail.com)
• Was a PhD student and PostDoc in Tsinghua University
• One of the initial committers of Apache IoTDB (incubating)
5. • Was a PhD student and PostDoc in Tsinghua University
6. The Start
• Was a PhD student and PostDoc in Tsinghua University
• it was the start of the following story when I knocked the door of
my supervisor’s office in 2011…
My supervisor
(Jianmin Wang)
me
My supervisor
(Jianmin Wang)
me
7. The Start
My supervisor
(Jianmin Wang)
me
Xiangdong, Why do you
want to be a PhD at
School of Software?
I want to develop
something that be used
by millions of people!
Come on!
Do some cool softwares that can be used by many many people.
9. As an Individual Developer
• Write a lot small “tools“
• But no maintaining
• Just for fun/self-use
10. Developer as a Student
• Many courses
• Do not need to write to much codes (in some home works)..
• Good for improve skill, and hard to get the full score (because some are really hard!).
Data Mining Modern Database
100 lines? innovation
11. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
12. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
student
reviews
13. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
What if I click
here first.
14. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
STOP!
YOU
CANNOT!
What if I click
here first.
15. We are writing demo and demo and demo…
• Complex project management?
• Makefile? POM? Gradle?
• Agile? Scrum? Sprint?
• CI? CD?
A pom file example
From Apache PLC4x
16. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
17. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
18. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
• 2.2.0, 2.2.1, …2.2.5;
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?
19. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
• 2.2.0, 2.2.1, …2.2.5;
• Patch
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?
Wow, someone
share a patch
file to fix a bug!
Yes, you are growing! You have known JIRA, etc..
20. • When can I get rid of writing demo, and do some
nice software like Apache Cassandra, Hadoop, etc..
22. A New Hope
• Be active in an existing open source community
• Hadoop, Cassandra, Spark etc..
• Be active in a new open source community
• IoTDB etc..
24. A good DB can improve the whole process
Network
MQ Database
queryinsertion
save data
locally
Network
analysis
25. And no good software
RDB
KVDB
LSM based
•Efficient file structure
•More query functions
Not optimize for
some application
scenarios
TSDB
Limited number of
columns
1600 Columns in a table
Limited number of rows
<=10M rows is better
Manual Sharding
• Support big data
• Limited Queries
• Lack time filtering
• Lack value filtering
• Lack multiple time series
alignment
Based on PG
•Auto sharding
•Query optimization
Performance degrades
sharply after writing
data for a long time
Hbase/Cassandra based
•Partition by TS-UID
and time range
• Storage inefficiency
• Limit of queries
27. 1. Teamwork
• Git with 10+ persons Team
• Commitlog
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
Let your software >= 100K Lines.
28. 2. Learn skills
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
Let your software powerful.
29. 3. Stability/Agile
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
• CI/CD
• Jenkins, travis-CI
Let your software really really can be used.
30. 4. Open your mind
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
• CI/CD
• Jenkins, travis-CI
• Issue -> PR -> Release
Open your minds.
Improve your communication skills.
31. 5. Research and Project
• User requirements -> Implementation -> IoTDB -> User
• Idea -> Implementation -> IoTDB -> Evaluation -> Paper -> User
• Paper -> Implementation -> IoTDB -> Evaluation -> User
32. OK….
• Past
• I can write a demo
• I like to write something
• I like to write something used
by myself
• Now
• I/We know how to write a
complex software
• I/We know how to write a
software used by people
33. Do it ourselves
• Know a lot about how Apache project are developed!
• How the website of an Apache project is built?
• Who can be a committer of an Apache project?
• How to release projects?
• Who decides the new features of an Apache project?
• Etc..
34. Time Series DB for Industrial Internet
“清华数为” 时间序列数据库 -->Apache IoTDB (incubating)
• Apache IoTDB (incubating) is a
high efficient Database for
managing time series data,
especially in Industry Internet
applications.
• A young community. Donated by
Tsinghua University, 2018.11-18
entered the incubator.
• Devoted to building the best time
series database (in IoT area) in the
world.
• Apache IoTDB v0.8.1 is released!
v0.9.0 is coming!
36. Concepts in IoTDB (The Schema)
Device (i.e., Data source)
• A machine instance
Measurement (e.g., sensor)
• A device can have many measurements
Time Series
• Device + Measurement
• is represented as a path that begins with root, like
“root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain”
Storage Group (SG)
• A storage group can have many devices
• Storage groups have independent resources
(threads and files) to increase parallelism and
reduce competitions for locks.
Cadillac XT5
38. Set time series group
SET STORAGE GROUP TO root.laptop.d1.s1;
Create Timeseries
CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE
Insert Data
INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234);
Delete Data
DALETE FROM d1.s1 WHERE time < 1000;
Update Data
UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000;
Query Data (Filter, Aggregation, Group by time interval)
SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5;
SELECT count(status), max_value(temperature) from root.ln.wf01.wt01;
SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11-
03T23:00:00]);
SQL in IoTDB
39. Supported data type
• Boolean
• Int
• Long
• Float
• Double
• String
• GPS (TODO) -> for trajectory data management
• Array (TODO) -> for unstructured data management
40.
41. 41
TsFile: Zip File Born for Time Series Data
Columnar
Store
- Reduce Disk I/O
- Improve Compression
Compression
&
Encoding
- Improve Compression Greatly
- 15% Better than InfluxDB in
Real Applications
Time-domain
Statistics Info
Natively
- Support Fast Query in
- Time Domain
- Value Domain
- Freq Domain (TODO)
detailed specification:
http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3
https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
42. Adaptive Delta encoding – Int or Long (TODO)
Gorilla encoding – Float or Double
128, 136, 144, 152, 160, …
8, 8, 8, 8 1st difference is constant.
0, 0, 0 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 1st difference is not constant though
1, 3, -2 2nd difference is 2-bit storage needed!
• Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Int or Long (timestamps)
• A adaptive enhance for TS2Diff.
• See next page.
RLE encoding – repeated Int or Long
• For repeated sequence: store a value and its count
Bit-Packing encoding – Int or Long
• Store data in compact form
• squeeze out wasteful bits
• XOR consecutive data points
• Store with variable length encoding scheme
Snappy Gzip (TODO) LZO (TODO)
Compression Algorithm
TsFile: Encoding and Compression
43. Adaptive TS2Diff encoding – Int or Long (TODO)
• For time series with outliers or missing points
• Storing second-order delta values and a boolean flag array.
TsFile: Encoding and Compression
44. Time Series Specific Operations (TODO)
Pattern Matching for Streaming Time Series Data
Split the pattern and data stream into
equal length fragments
Extract features to reduce the dimension
Accelerate the search by using features
Scenario:fault alarm in real time
44
SELECT wind_3s FROM china.farm1.tb2
WHERE time > t1 AND time < t2
AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0)
Similarity Search of Sub-series
Indexing data using Key-Value form
Scenarios:
Outlier detection
Historical data analysis
…
45. From Edge to Cloud: Run IoTDB Everywhere
Time series data files: high-tech
write, high compression ratio,
support simple queries. Simply
put, TsFile is a zip file for time
series data.
Suitable for embedded devices,
general servers, data centers, etc.
TsFile (a component of IoTDB)
A zip file of time series
Freely operate time series of
multiple TsFiles, including: CRUD
and advanced query like:max, min,
avg and temporal alignment.
Scene: Embedded equipment, on-
site industrial computer, general
server, etc.
IoTDB
A database of time series
3rd Systems
Easy to use and integrate for
complex analysis(data fusion,
collaborative recommendation,
machine learning)
Scene: Cloud data center
A data warehouse of time series
46. A Process to Manage Time Series Data
data source
or
JDBC / Session API
JDBC / Session API
Grafana-Adaptor Spark-TsFile-AdaptorJDBC
Analysis with Big Data Framework
(big data set)
Analysis with Matlab
(small data set)
Visualization
(Manual data explore)
https://github.com/jixuan1989/iotdb-tutorial
47. Latest version v0.8 (0.9.0-snapshot)
Apache IoTDB-incubating v0.9.0-SNAPSHOT
Xeon E5v4
256G Mem
HDD Disk
#Client #Storage
Group
#Device #Measurem
ent per
Device
DataType Encoding Compressio
n
BatchSize #Point per
Time Series
10 50 1000 100 Float RLE Snappy 100 100000
Insertion
#Client #Storage
Group
#Device #Measure
ment per
Device
DataType Encoding Compressi
on
BatchSize #Point per Time
Series
50 1 1 10 Float RLE Snappy 100 100000000
Query
49. Write Performance: points/s(single node)
Xeon E5v4
256G Mem
HDD Disk
* In this experiment, we do not use IoTDB’s JDBC API and SQL interface.
Instead, we use a raw API like Cassnadra’s Raw Thrift API.
Apache IoTDB-incubating v0.9.0-SNAPSHOT
50. Query Performance: aggregation count()
InfluxDB failed to return
any answers in the
100,000,000 setting.
Xeon E5v4
256G Mem
HDD Disk
Apache IoTDB-incubating v0.9.0-SNAPSHOT
51. Shanghai METRO Monitoring
…
144 trains
9 KairosDB + Cassandra
3200 points/500 ms/train
14 Restful service just for avoiding
modifying current programs
KDB compatible
Restful Service
KDB compatible
Restful Service
KDB compatible
Restful Service
ONE IoTDB
instance
300 trains
3200 points/200 ms/train
414 Billion
data points
per day
just using
ONE IoTDB
instance
upgrade