Webinar by Oleksandr Fedirko, CEE Head of BigData Practice at GlobalLogic
Marketing things are boring, aren’t they?
Starting points
Challenges on a project
Next steps and evolution
Conclusions
7. 7
Confidential
What is marketing about ?
● Research what we buy
● Figure out purchase behavior
● Target audience for AD better
8. 8
Confidential
What is marketing about ?
● Research what we buy
● Figure out purchase behavior
● Target audience for AD better
● Help adjust AD campaigns
9. 9
Confidential
How advertiser business works
SellersBuyers
Ad Network Ad Network
Agency DSP Ad Exchange SSP Publisher
DMP/Data Supply
Brand Audience
RTB
10. 10
Confidential
How advertiser business works
SellersBuyers
Ad Network Ad Network
Agency DSP Ad Exchange SSP Publisher
DMP/Data Supply
Brand Audience
RTB
12. 12
Confidential
Business scenario
Figure out how advertising (online and offline, impression) leads us to the
purchase
I saw ad on my cell phone, then on my laptop, then on a printed coupons
and I bought promoted item.
14. 14
Confidential
Privacy concerns
Q: Do such companies spy on me?
A: In some way yes, but you agreed to share this info about yourself. Read
EULA
Q: Do they want to steal my private data ?
A: No, they don’t want
15. 15
Confidential
Privacy concerns
Q: Do such companies spy on me?
A: In some way yes, but you agreed to share this info about yourself. Read
EULA
Q: Do they want to steal my private data ?
A: No, they don’t want
Q: Do they have my bank accounts numbercredit cards numbers ?
A: No, they don’t have
16. 16
Confidential
Privacy concerns
Q: Do such companies spy on me?
A: In some way yes, but you agreed to share this info about yourself. Read
EULA
Q: Do they want to steal my private data ?
A: No, they don’t want
Q: Do they have my bank accounts numbercredit cards numbers ?
A: No, they don’t have
Q: Do they buy information about me from other companies ?
A: Yes, they do
22. 22
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
- HDP Ambari managed services
- No unit-tests
- NiFi extensive use
- Scala based Spark jobs
23. 23
Confidential
Starting points
BigData platform in Microsoft Azure IaaS
- Cloudbreak based setup
- HDP Ambari managed services
- No unit-tests
- NiFi extensive use
- Scala based Spark jobs
- No CI/CD
41. Our data
platform
Store orders
and
impressions
Store orders
and
impressions
Matched
impressions
Old data
platform
AdExhange3
AdExhange1
Identities
AdExhange2
Kafka
Analytics
AdExhange4
MDM
Impressions
IdentitiesSetup
campaign
42. UI
DataLake
Refined Target Presentation Index
ing
Raw
Business scenario 1
Data
sources
Starting points
Ingestion
Business scenario 2
Orchestration
Transformation
46. 46
Confidential
Challenges on a project
No unit-tests
Let’s make it!
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.SparkConf
import org.scalatest.{BeforeAndAfter, FunSuite, Matchers}
47. 47
Confidential
Challenges on a project
test("load id map table data") {
// given
val expectedData = List(
MyMap(id1 = "GUID1", id2 = "GUID2", id_type = "aaid").......
)
val expected = spark.createDataFrame(sc.parallelize(expectedData))
expected.write.mode(SaveMode.Append).format("hive").partitionBy("id_type").sav
eAsTable(Db.MySchema.name + "." + Db.MySchema.Table.MyTable)
// when
val actual = MainClass.functionToTest(spark, Db.MySchema.name,
Db.MySchema.Table.MyTable)
48. 48
Confidential
Challenges on a project
// then
val actualFieldsQueried = actual.schema.fields.map(f => f.name)
withClue("Actual fields queried:n" + actual.schema.treeString) {
actualFieldsQueried shouldEqual Array("id1", "id2", "id_type")
}
val actualData = actual.collect()
withClue(actualData.mkString("n", "n", "n")) {
actualData.length should equal(expectedData.size)
withClue("Actual id2 field differs from expected") {
actualData.map(r => r.getAs[String]("id2")) should contain
theSameElementsAs expected.map(id => id.id2)
}
}
}
54. 54
Confidential
Challenges on a project
One job runs for 4 hours and take all resources of the cluster
Job has to analyze history for the last 52 weeks of orders history
55. 55
Confidential
Challenges on a project
One job runs for 4 hours and take all resources of the cluster
Job has to analyze history for the last 52 weeks of orders history
We can make it incremental!
57. 57
Confidential
Challenges on a project
No automation of rollout for the cluster
- Time to setup new cluster is about 7-10 days
58. 58
Confidential
Challenges on a project
No automation of rollout for the cluster
- Time to setup new cluster is about 7-10 days
- Cloudbreak blueprints do not help too much
61. 61
Confidential
Challenges on a project
No CICD
- Make at least build, unit-test and deployment of the jars automated
- Partially covered CICD of the Oozie scripts
62. 62
Confidential
Challenges on a project
No CICD
- Make at least build, unit-test and deployment of the jars automated
- Partially covered CICD of the Oozie scripts
- CD for the shell scripts
65. 65
Confidential
Challenges on a project
Low level of security
- Every developer uses sa account to get to the edge node
- Single admin user for Ambari
66. 66
Confidential
Challenges on a project
Low level of security
- Every developer uses sa account to get to the edge node
- Single admin user for Ambari
- Developers has access to PROD
67. 67
Confidential
Challenges on a project
Stuck with Kafka 0.7
Customer has lock himself with this old version of Kafka
There only option to consume messages is to use old Java library
72. 72
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
- Move to the Airflow
- Move to the Databricks
73. 73
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
- Move to the Airflow
- Move to the Databricks
Strategic:
- Get rid of Kafka 0.7
74. 74
Confidential
Next steps and evolution
Tactical:
- NiFi for stream ingestion only
- Move to the Airflow
- Move to the Databricks
Strategic:
- Get rid of Kafka 0.7
- Switch to the Next Generation platform
75. 75
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
76. 76
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
77. 77
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
- Azure Databricks
78. 78
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
- Azure Databricks
- Azure Data Lake
79. 79
Confidential
Next steps and evolution
What is Next Generation platform
Fully managed by Azure PaaS based data platform
- Azure Data Factory
- Azure Databricks
- Azure Data Lake
- Azure EventHub
83. 83
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
84. 84
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
- General programming practices for everyone, not just for Software
Engineers
85. 85
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
- General programming practices for everyone, not just for Software
Engineers
- Automate everything that you can
86. 86
Confidential
Conclusions
- DevOps, QA, L2 and InfoSec are your best friends
- There is a big demand from companies to move into PaaS
- Don’t leave security at last
- General programming practices for everyone, not just for Software
Engineers
- Automate everything that you can
- Documentation first