class: center, middle, inverse, title-slide .title[ # Automatic Sampling and Analysis of YouTube Data ] .subtitle[ ## Tools for collecting YouTube data ] .author[ ### Johannes Breuer, Annika Deubel, & M. Rohangis Mohseni ] .date[ ### February 14th, 2023 ] --- layout: true --- ## How to Collect *YouTube* Data There are many different ways in which data from *YouTube* and other social media can be collected (see [Breuer et al., 2020](https://journals.sagepub.com/doi/10.1177/1461444820924622)): - Manually (e.g., via copy & paste and manual content analysis) - Using existing data, such as [*YouNiverse: Large-Scale Channel and Video Metadata from English YouTube*](https://zenodo.org/record/4650046) (also see the accompanying preprint by [Ribeiro & West, 2021](https://arxiv.org/abs/2012.10378)) - Automatically via the *YouTube* API or web scraping --- ## Identifying Relevant Channels or Videos If new data is collected, it is necessary to identify relevant channels and videos for the sample. - [VTracker](https://vtracker.host.ualr.edu/) - [Socialblade](https://socialblade.com/) - [YouTube Channel Crawler](https://channelcrawler.com/) --- ## VTracker - Search for and tracking of videos - Low-key analysis such as engagement, keyword trends, influence detection - Creation of Dashboard for different metrics - Data can't be collected for further analysis - Still a bit buggy --- <img src="data:image/png;base64,#../img/vtracker.png" width="60%" style="display: block; margin: auto;" /> --- ## Socialblade - Ranked lists of channels - Useful if there are no content-related criteria for channel selection <img src="data:image/png;base64,#../img/socialblade.png" width="80%" style="display: block; margin: auto;" /> --- ## YouTube Channel Crawler - Search for channels with the help of filters (e.g. language, likes) - Useful if there are no content-related criteria for channel selection <img src="data:image/png;base64,#../img/channelcrawler.png" width="80%" style="display: block; margin: auto;" /> --- ## Excluding Problematic Channels - [YouTube Wiki](https://youtube.fandom.com/de/wiki/YouTube_Wiki) - Social background information on channels (only in German) - Useful to identify reasons for exclusion (e.g., fight between channels) If the relevant channels are identified and potentially problematic channels are excluded, the next step would be to sample the comments. Some of the comment sampling tools also offer search functions that can be used in addition to or instead of the tools mentioned above. --- ## Comparisons of Approaches for Collecting *YouTube* Data .small[ <table> <thead> <tr> <th style="text-align:center;"> Software </th> <th style="text-align:center;"> Type </th> <th style="text-align:center;"> Can collect </th> <th style="text-align:center;"> Comment Scope </th> <th style="text-align:center;"> Needs API Key </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> YouTube Data Tools 1.22 </td> <td style="text-align:center;"> Website </td> <td style="text-align:center;"> Channel Info, Video Info, Comments </td> <td style="text-align:center;"> x top-level or all </td> <td style="text-align:center;"> No </td> </tr> <tr> <td style="text-align:center;"> Webometric 4.3 </td> <td style="text-align:center;"> Standalone app </td> <td style="text-align:center;"> Channel Info, Video Info, Comments, Video Search </td> <td style="text-align:center;"> 100 most recent or all </td> <td style="text-align:center;"> Yes </td> </tr> <tr> <td style="text-align:center;"> Tuber 0.9.9 </td> <td style="text-align:center;"> R package </td> <td style="text-align:center;"> Channel Info, Video Info, Comments, Subtitles, All searches </td> <td style="text-align:center;"> 20-100 most recent or all </td> <td style="text-align:center;"> Yes </td> </tr> <tr> <td style="text-align:center;"> vosonSML 0.29.13 </td> <td style="text-align:center;"> R package </td> <td style="text-align:center;"> Video IDs, Comments </td> <td style="text-align:center;"> 1-x top-level </td> <td style="text-align:center;"> Yes </td> </tr> <tr> <td style="text-align:center;"> youtubecaption 1.0.0 </td> <td style="text-align:center;"> R package </td> <td style="text-align:center;"> Subtitles </td> <td style="text-align:center;"> n/a </td> <td style="text-align:center;"> No </td> </tr> </tbody> </table> ] --- ## YouTube Data Tools [YouTube Data Tools](https://tools.digitalmethods.net/netvizz/youtube/) <img src="data:image/png;base64,#../img/ytdt.png" width="1656" style="display: block; margin: auto;" /> --- ## Webometric [Webometric 4.3](http://lexiurl.wlv.ac.uk/searcher/youtube.html) <img src="data:image/png;base64,#../img/webometric.png" width="90%" style="display: block; margin: auto;" /> --- ## Exemplary Comparison of the Different Tools .small[ <table> <thead> <tr> <th style="text-align:center;"> Software </th> <th style="text-align:center;"> Ease of Use </th> <th style="text-align:center;"> Disadvantages </th> <th style="text-align:center;"> No. of Comments </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> YouTube Data Tools 1.30 </td> <td style="text-align:center;"> High </td> <td style="text-align:center;"> Lacking flexibility, less information </td> <td style="text-align:center;"> 54,850 </td> </tr> <tr> <td style="text-align:center;"> Webometric 4.1 </td> <td style="text-align:center;"> Low </td> <td style="text-align:center;"> Only first 5 follow-up comments, no error feedback, undetectable time-outs </td> <td style="text-align:center;"> 51,095 </td> </tr> <tr> <td style="text-align:center;"> Tuber 0.9.9 </td> <td style="text-align:center;"> Low </td> <td style="text-align:center;"> Only first 5 follow-up comments </td> <td style="text-align:center;"> 51,084 </td> </tr> <tr> <td style="text-align:center;"> vosonSML 0.29.13 </td> <td style="text-align:center;"> Low </td> <td style="text-align:center;"> Lacking flexibility, only comments </td> <td style="text-align:center;"> 52,679 </td> </tr> </tbody> </table> ] Example data source: [Dayum Video](https://www.youtube.com/watch?v=DcJFdCmN98s) --- ## A Note on Using FOSS The tools listed are free and open source software (FOSS). Using FOSS has many advantages (availability, adaptability, etc.). However, one risk associated with using FOSS is that tools are not maintained anymore and cease to function. After all, people create and maintain these tools in their spare time or as side projects and this work is often not recognized enough (esp. within academia). For this reason it is important to acknowledge the work that goes into these tools by properly citing them. .small[ ```r citation("tuber") ``` ``` ## ## To cite package 'tuber' in publications use: ## ## Gaurav Sood (2020). tuber: Access YouTube from R. R package version 0.9.9. ## ## Ein BibTeX-Eintrag für LaTeX-Benutzer ist ## ## @Manual{, ## title = {tuber: Access YouTube from R}, ## author = {Gaurav SOod}, ## year = {2020}, ## note = {R package version 0.9.9}, ## } ``` ] --- class: center, middle # Any questions so far?