My First Data Science Project (using Rapid Miner)
For Data Science Thailand Meetup #2
datascienceth.com
facebook.com/datascienceth
Dr. Eakasit Pacharawongsakda
My First Data Science Project (using Rapid Miner)
For Data Science Thailand Meetup #2
datascienceth.com
facebook.com/datascienceth
Dr. Eakasit Pacharawongsakda
Data Architecture Best Practices for Today’s Rapidly Changing Data LandscapeDATAVERSITY
With the rise of the data-driven organization, the pace of innovation in data-centric technologies has been tremendous. New tools and techniques are emerging at an exponential rate, and it is difficult to keep track of the array of technological choices available to today’s data management professional.
At the same time, core fundamentals such as data quality and metadata management remain critical in order for organizations to obtain true business value from their data. This webinar will help demystify the options available: from data lake to data warehouse, to graph database, to NoSQL, and more, and how to integrate these new technologies with core architectural fundamentals that will help your organization benefit from the quick wins that are possible from these exciting technologies, while at the same time build a longer-term sustainable architecture that will support the inevitable change that will continue in the industry.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...Medicaps University
Data warehousing Components –Building a Data warehouse,
Need for data warehousing,
Basic elements of data warehousing,
Data Mart,
Data Extraction, Clean-up, and Transformation Tools –Metadata,
Star, Snow flake and Galaxy Schemas for Multidimensional databases,
Fact and dimension data,
Partitioning Strategy-Horizontal and Vertical Partitioning.
The document provides an overview of DataOps and continuous integration/continuous delivery (CI/CD) practices for data management. It discusses:
- DevOps principles like automation, collaboration and agility can be applied to data management through a DataOps approach.
- CI/CD practices allow for data products and analytics to be developed, tested and released continuously through an automated pipeline. This includes orchestration of the data pipeline, testing, and monitoring.
- Adopting a DataOps approach with CI/CD enables faster delivery of data and analytics, more efficient and compliant data pipelines, improved productivity, and better business outcomes through data-driven decisions.
This slides present concept of Data Mining and Big Data Analytics. The topices are:
- Internet of Things (IoT)
- Data Science/Mining applications
- Data Science/Mining techniques including (1) Association, (2) Clustering, (3) Classification
- CRISP-DM: Cross Industry Standard Process for Data Mining
Data Architecture Best Practices for Today’s Rapidly Changing Data LandscapeDATAVERSITY
With the rise of the data-driven organization, the pace of innovation in data-centric technologies has been tremendous. New tools and techniques are emerging at an exponential rate, and it is difficult to keep track of the array of technological choices available to today’s data management professional.
At the same time, core fundamentals such as data quality and metadata management remain critical in order for organizations to obtain true business value from their data. This webinar will help demystify the options available: from data lake to data warehouse, to graph database, to NoSQL, and more, and how to integrate these new technologies with core architectural fundamentals that will help your organization benefit from the quick wins that are possible from these exciting technologies, while at the same time build a longer-term sustainable architecture that will support the inevitable change that will continue in the industry.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...Medicaps University
Data warehousing Components –Building a Data warehouse,
Need for data warehousing,
Basic elements of data warehousing,
Data Mart,
Data Extraction, Clean-up, and Transformation Tools –Metadata,
Star, Snow flake and Galaxy Schemas for Multidimensional databases,
Fact and dimension data,
Partitioning Strategy-Horizontal and Vertical Partitioning.
The document provides an overview of DataOps and continuous integration/continuous delivery (CI/CD) practices for data management. It discusses:
- DevOps principles like automation, collaboration and agility can be applied to data management through a DataOps approach.
- CI/CD practices allow for data products and analytics to be developed, tested and released continuously through an automated pipeline. This includes orchestration of the data pipeline, testing, and monitoring.
- Adopting a DataOps approach with CI/CD enables faster delivery of data and analytics, more efficient and compliant data pipelines, improved productivity, and better business outcomes through data-driven decisions.
This slides present concept of Data Mining and Big Data Analytics. The topices are:
- Internet of Things (IoT)
- Data Science/Mining applications
- Data Science/Mining techniques including (1) Association, (2) Clustering, (3) Classification
- CRISP-DM: Cross Industry Standard Process for Data Mining
Introduction to big data and analytic eakasit patcharawongsakdaBAINIDA
Introduction to big data and analytic Eakasit Patcharawongsakda ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
This slide present Data Analytics concept. Topics are level of analytics, CRISP-DM, data science use cases e.g., customer segmentation, churn prediction, product recommendation, demand forecasting
Slide บรรยายพิเศษ หัวข้อเครื่องมือและเทคโนโลยีสำหรับสร้างธุรกิจอัจฉริยะ (Internet of Things for Business)
โครงการสัมมนาทางวิชาการการสร้างนวัตกรรมใหม่สำหรับธุรกิจด้วย Internet of Things สาขาวิชาระบบสารสนเทศและคอมพิวเตอร์ธุรกิจ คณะบริหารธุรกิจและเทคโนโลยีสารสนเทศ มหาวิทยาลัยเทคโนโลยีราชมงคลสุวรรณภูมิ ศูนย์หันตรา
26 มีนาคม 2562
This document provides an introduction and overview of big data technologies. It begins with an outline that covers introductions to big data, NoSQL databases, MapReduce and Hadoop, and Hive, HBase and Sqoop. It then discusses relational databases and SQL before introducing NoSQL databases. Key reasons for using NoSQL databases are explained, including improved scalability, lower costs, flexibility in data structures, and high availability. Examples of big data applications and the internet of things are also presented.
This presentation described Big Data concept. Then it shows example of applications in Banking. The presenter is Dr. Tuangtong Wattarujeekrit in Big Data Analytics Day event.
This document discusses using predictive analytics for retail businesses. It outlines using store clustering and RFM (recency, frequency, monetary) analysis to develop predictive models. Store clusters would be used to design customized planograms and segmentation strategies. RFM would analyze active and expired customers to develop targeted strategies like offering discounts or new products to high value customers or win back lower value, expired customers. The overall goal is to use predictive models to improve planogram performance, customer retention and reactivation, and sales.
1. ดร.เอกสิทธิ์ พัชรวงศ์ศักดา
ผู้ร่วมก่อตั้งห้างหุ้นส่วนสามัญดาต้า คิวบ์ และ
ผู้อำนวยการหลักสูตรวิศวกรรมข้อมูลขนาดใหญ่ (Big Data Engineering)
วิทยาลัยนวัตกรรมด้านเทคโนโลยีและวิศวกรรมศาสตร์ มหาวิทยาลัยธุรกิจบัณฑิตย์
eakasit@datacubeth.ai
Automation Expo 2019
วันพฤหัสบดีที่ 14 มีนาคม 2562
First Step to Big Data
2. http://dataminingtrend.com http://facebook.com/datacube.th
About us
• ชื่อ: เอกสิทธิ์ พัชรวงศ์ศักดา
• การศึกษา:
• ปริญญาเอก วิทยาการคอมพิวเตอร์
สถาบันเทคโนโลยีนานาชาติสิรินธร (SIIT) มหาวิทยาลัยธรรมศาสตร์
• ปริญญาโท วิศวกรรมคอมพิวเตอร์ มหาวิทยาลัยเกษตรศาสตร์
• ปริญญาตรี วิศวกรรมคอมพิวเตอร์ มหาวิทยาลัยเกษตรศาสตร์
(เกียรตินิยมอันดับ 2)
• ประสบการณ์
• Certified RapidMiner Expert and Ambassador
• Research Collaboration with Western Digital (Thailand) เฟสที่ 1 ระยะเวลา 6 เดือน
• ร่วมวิจัย โครงการสํารวจข้อมูลเพื่อการวิเคราะห์พฤติกรรมของนักท่องเที่ยวเชิงลึก ด้วยวิธีการทําเหมือง
ข้อมูล การท่องเที่ยวแห่งประเทศไทย (ททท)
• วิทยากรอบรมการใช้งานซอฟต์แวร์ open source ทางด้าน data mining
2
27. http://dataminingtrend.com http://facebook.com/datacube.th
Data is a new oil
• “เมื่อข้อมูลมีค่าดั่งน้ำมัน”
27Source: https://www.raconteur.net/technology/drilling-for-new-oil-of-big-data
Customer
Data
Transaction
Data
Website
Data
Survey
Data
Social
Network
Data Employee
Data
Reports
Actionable
Insight
Predictive
Model
Recommended
Products
www.facebook.com/datacube.th
31. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Big Data ประกอบด้วย 3 V
• Volume
• ข้อมูลมีจำนวนเพิ่มขึ้นอย่างมหาศาล
• Velocity
• ข้อมูลเพิ่มขึ้นอย่างรวดเร็ว
• Variety
• ข้อมูลมีความหลากหลายมากขึ้น
31
source: https://upxacademy.com/beginners-guide-to-big-data/
35. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
35
37. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
• Complexity of data types and structures
• ข้อมูลมีความหลากหลาย ไม่ได้อยู่ในรูปแบบของตารางเท่านั้น อาจจะเป็น
รูปแบบของข้อความ (text) รูปภาพ (images) หรือ วิดีโอ (video clip)
37
54. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Map
• The input to this mapper would be strings that contain sentences,
and the mapper’s function would be to split the sentences into
words and output the words
54
Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
63. http://dataminingtrend.com http://facebook.com/datacube.th
BI & Data Mining
63
Business
Intelligence
Data
Mining
Time
Analytical
Approach
Past Future
Explanatory
Exploratory
source:Data Science and Big Data Analytics: Discovering, analyzing, visualizing and presenting data
BI questions
• What happened last
quarter?
• How many unit sold?
• Where is the problem? In
which situations
Data Mining questions
• What if … ?
• What will happen next?
• Why is this happen?
www.facebook.com/datacube.th
64. http://dataminingtrend.com http://facebook.com/datacube.th
What is data mining
• “The exploration and analysis of large quantities
of data in order to discover meaningful patterns and
rules” – Data Mining Techniques (3rd Edition)
• เป็นการวิเคราะห์ข้อมูล เพื่อหารูปแบบ (patterns) หรือความสัมพันธ์
(relation) ระหว่างข้อมูลในฐานข้อมูลขนาดใหญ่
• “Extraction of interesting (non-trivial, previously,
unknown and potential useful) information from data in
large databases” – Data Mining Concepts &
Techniques (3rd Edition)
• เป็นกระบวนการดึงข่าวสารที่น่าสนใจ และมีประโยชน์แต่ไม่เคยรู้มา
ก่อนจากฐานข้อมูลขนาดใหญ่
64
image sources: https://binarylinks.wordpress.com/tag/data-mining/
http://www.amazon.com/Data-Mining-Techniques-Relationship-Management/dp/0470650931
65. http://dataminingtrend.com http://facebook.com/datacube.th
What is data mining
65
ข้อมูล' เทคนิคการทำ data mining' รูปแบบที่มีประโยชน์'
image source:http://www.computerrepairanaheim.net
https://sites.google.com/a/whps.org/diamond-teamkp/
http://meetings2.informs.org/wordpress/analytics2014/2014/04/01/why-oranalytics-people-need-to-know-about-database-technology/
85. http://dataminingtrend.com http://facebook.com/datacube.th
Data Science/Data Mining methods
• ตัวอย่าง spam e-mail classification
• สร้างโมเดล (classification model) จากข้อมูล training data ซึ่งมีลาเบล (label)
85
ID Free Won Cash Type
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
attribute label
Free
Won
Normal Spam
Spam
classification model
= N = Y
= N = Y
training data
www.facebook.com/datacube.th
86. http://dataminingtrend.com http://facebook.com/datacube.th
Data Science/Data Mining methods
• ตัวอย่าง spam e-mail classification
• นำข้อมูลใหม่ (unseen data) ทำนายโดยใช้โมเดล
86
attribute
Free
Won
Normal Spam
Spam
classification model
= N = Y
= N = Y
training data
ID Free Won Cash Type
11 Y Y N ?
12 N Y N ?
www.facebook.com/datacube.th
87. http://dataminingtrend.com http://facebook.com/datacube.th
Data Science/Data Mining methods
• ตัวอย่าง spam e-mail classification
• นำข้อมูลใหม่ (unseen data) ทำนายโดยใช้โมเดล
87
attribute
Free
Won
Normal Spam
Spam
classification model
= N = Y
= N = Y
training data
ID Free Won Cash Type
11 Y Y N ?
12 N Y N ?
www.facebook.com/datacube.th
88. http://dataminingtrend.com http://facebook.com/datacube.th
Data Science/Data Mining methods
• ตัวอย่าง spam e-mail classification
• นำข้อมูลใหม่ (unseen data) ทำนายโดยใช้โมเดล
88
attribute
Free
Won
Normal Spam
Spam
classification model
= N = Y
= N = Y
training data
ID Free Won Cash Type
11 Y Y N ?
12 N Y N ?
www.facebook.com/datacube.th
89. http://dataminingtrend.com http://facebook.com/datacube.th
• ตัวอย่าง spam e-mail classification
ID Free Won Cash Type
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
Classification example
89
attribute labelID
training data
สร้าง classification model
ID Free Won Cash Type
11 Y Y N ?
12 N Y N ?
unseen data
classification model
ID Type
11 spam
12 spam
1
2
3 4
www.facebook.com/datacube.th
90. http://dataminingtrend.com http://facebook.com/datacube.th
• แนะนำข้อมูลขนาดใหญ่ (Big Data)
• แนะนำเทคโนโลยี Internet of Things (IoT)
• แนะนำเทคโนโลยีการจัดการข้อมูลขนาดใหญ่
• แนะนำเทคนิคการวิเคราะห์ข้อมูลด้วยดาต้า ไมน์นิง
• ตัวอย่างการประยุกต์ใช้งาน
• กระบวนการมาตรฐานในการวิเคราะห์ข้อมูลด้วยเทคนิคดาต้า ไมน์นิง
หัวข้อการบรรยาย
90
102. http://dataminingtrend.com http://facebook.com/datacube.th
Big Data & Analytics Applications
• คาดการณ์การลาออกของพนักงาน
102
Receive Promotion
= NO = YES
Years with firm < 5
Not Quit
= YES = NO
Partner changed job
Quit Not Quit
= YES = NO
Quit
ตัวอย่างโมเดลคาดการณ์การลาออกของพนักงาน
108. http://dataminingtrend.com http://facebook.com/datacube.th
Data Science Use cases
108
• Churn Prevention
• Identify customers likely to leave, take
preventative action.
• Customer Lifetime Value
• Distinguish between customers based on
business value.
• Customer Segmentation
• Create meaningful customer groups for
more relevant interactions.
• Demand Forecasting
• Know what volumes to expect to improve
planning.
• Fraud Detection
• Identify fraudulent activity quickly, and
end it.
• Next Best Action
• The right action at the right time for the
right customer.
• Price Optimization
• Set prices that balance demand, profit,
and risk.
• Product Propensity
• Predict what your customers will buy,
before even they know it.
109. http://dataminingtrend.com http://facebook.com/datacube.th
Data Science Use cases
109
• Text Mining
• Extract insight from unstructured content.
• Next Best Action
• The right action at the right time for the
right customer.
• Price Optimization
• Set prices that balance demand, profit,
and risk.
• Product Propensity
• Predict what your customers will buy,
before even they know it.
• Quality Assurance
• Resolve quality issues before they
become a problem.
• Predictive Maintenance
• Predict equipment failure, plan cost-
effective maintenance.
• Risk Management
• Understand risk to manage it.
• Up- and Cross-Selling
• Convince customers to buy more.
112. http://dataminingtrend.com http://facebook.com/datacube.th
• แนะนำข้อมูลขนาดใหญ่ (Big Data)
• แนะนำเทคโนโลยี Internet of Things (IoT)
• แนะนำเทคโนโลยีการจัดการข้อมูลขนาดใหญ่
• แนะนำเทคนิคการวิเคราะห์ข้อมูลด้วยดาต้า ไมน์นิง
• ตัวอย่างการประยุกต์ใช้งาน
• กระบวนการมาตรฐานในการวิเคราะห์ข้อมูลด้วยเทคนิคดาต้า
ไมน์นิง
หัวข้อการบรรยาย
112
113. http://dataminingtrend.com http://facebook.com/datacube.th
How to start?
113
ow To Start?
Implementation
•Full-scale
development
•Prepare for
deployment
Build
competency
•Identify key areas
•Talent
development
Pilot Projects
•Evaluate vendors
•Build confidence
•Validate ROI
DOMAIN KNOWLEDGE
Business issues
•What could be
improved?
•List them down
Prioritize
•How critical?
•What happens if
not solved?
Feasibility
•What’s the
available data?
•Potential ROI
114. http://dataminingtrend.com http://facebook.com/datacube.th
The Journey of Data Driven Insights
114
Data Insight Action
The Journey of Data Driven InsightsAcquire
Streaming/In
Motion Data
Staging/At Rest
Data
OrganizeData Lake / Data
Warehouse
Data
Governance
Metadata
Management
Analyze
Advanced
Analytics
Machine
Learning
Deliver
Reporting
Dashboard
API
Domain Knowledge
115. http://dataminingtrend.com http://facebook.com/datacube.th
Data Analytics Framework
115
Data Sources
Web & Social
Existing
Systems
Clickstream
Geolocation
Sensor &
Machine
Open Data
Unstructured
Data Quality &
Business Rules
Data
Stewardship
Data Integration &
Data Quality
Data
Ingestion
ETL/ELT
Data Lake & Data
Warehouse
Hadoop:
- Distributed data store
- In-Hadoop processing
- Easy to scale
Data Synchronization
BI &
Data Science
Development Environment
Deployment Environment
- Scheduler
- Web services
- Interactive web apps
Operationalize
Batch
Stream
Real-Time
Data Governance & Security
Master Data Management (MDM)
Perform Advanced
Analytics in-Hadoop
without coding
Single Source of Truth
Warehouses/
Marts
- RDBMS
- NoSQL
Data
Processing
Acquire Organize Analyze
Deliver
130. http://dataminingtrend.com http://facebook.com/datacube.th
References
• Andrew Chisholm, Exploring Data with RapidMiner, November 2013
• Markus Hofmann, Ralf Klinkenberg, RapidMiner: Data Mining Use Cases and
Business Analytics Applications, October 25, 2013
• Foster Provost, Data Science for Business: What you need to know about
data mining and data-analytic thinking, August 19, 2013
• Eakasit Pacharawongsakda, An Introduction to Data Mining Techniques (Thai
version), 2014
130
131. http://dataminingtrend.com http://facebook.com/datacube.th
For more information
• หสม. ดาต้า คิวบ์ (data cube)
• website: http://www.dataminingtrend.com
• facebook: http:facebook.com/datacube.th หรือ http://facebook.com/sit.ake
• email: eakasit@datacube.asia
• lineID: eakasitp
131