00:01
Hello students, hadoop provides the various tools and the technology to faculties the ingestion of the streaming data.
00:08
The one popular approach is to use the apache kafka as the streaming data platform in conjunction with hadoop components like the hdfs and the apache spark.
00:22
As here is a high -level overview of how a streaming data can be ingested into the hadoop cluster.
00:31
As the first we'll go with the choose a streaming platform.
00:38
Choose a streaming platform.
00:45
Streaming platform with the apache kafka is a widely used streaming platform that provides the reliability, scale level and the distributed message processing.
00:59
It acts as a buffer between the data sources and the hadoop components ensuring the data durability and the availability.
01:08
The next is to set up the kafka.
01:15
So install and configure the apache kafka on a dedicated cluster or the services.
01:22
The kafka consists of the procedure that generates the data stream, brokers that stores and distributes the data and the customers that process the data.
01:35
So the next is to create the kafka topic.
01:45
So when the topics are logical channels for the organizing the data streaming, the procedure publish data to specific topic and customers subscribe to those topics to consume the data.
02:00
The next is to produce the data.
02:07
So the data sources such as the sensors, application and the external system generates the streaming data and sends it to the kafka topic using the procedures.
02:17
So kafka allows the high throughput data publishing.
02:22
As the after once you publish the data, you need to connect to the kafka.
02:28
Kafka connect is a framework for connecting the external system with the kafka topics...