Data Crawling Everything: Case of Social Media

SelectStar
9 min readJul 9, 2020

--

Introduction

In this tutorial, we are going to talk about data scraping and how it can be used for various purposes. Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer. We will be exploring how one can scrape data from social media sites, particularly Twitter. We will also learn the same for Reddit using its official API. Lastly, we will learn how to generally scrape the content in web pages and pictures from different web pages as well. So, let’s get started!

Prerequisites

Before you go ahead, please note that there are a few prerequisites for this tutorial. You should have some prior basic knowledge of Machine Learning, as well as basic programming knowledge in any language (preferably in Python). We will be using Jupyter Notebook for writing our code. If you do not already have it installed, visit Jupyter Notebook or work on any other code editor of your liking.

1. Scraping tweets from Twitter using Twint

There are a number of ways to scrape tweets from Twitter. You can do so using the Twitter API but a shortcoming of this is that it limits the number of tweets that can be scraped. Manually scraping the tweets is also one option but requires unnecessary time and effort. This is why we will be using Twint to collect our tweets from Twitter. Twint is a tool that allows you to scrape tweets on different basis e.g. the tweets of a particular user, tweets containing a particular keyword, tweets that are tweeted after or within a certain time, etc.

Installations

You can install Twint by typing the following command in your terminal

pip install twint

Scraping Twitter tweets using Twint

Scraping tweets of a particular user

import twint
config = twint.Config()
# Search tweets tweeted by user 'BarackObama'
config.Username = "BarackObama"
# Limit search results to 20
config.Limit = 20
# Return tweets that were published after Jan 1st, 2020
config.Since = "2020-01-1 20:30:15"
# Formatting the tweets
config.Format = "Tweet Id {id}, tweeted at {time}, {date}, by {username} says: {tweet}"
# Storing tweets in a csv file
config.Store_csv = True
config.Output = "Barack Obama"
twint.run.Search(config)

Output:

Tweet Id 1261004586359422979, tweeted at 18:44:56, 2020-05-14, by BarackObama says: Vote.
Tweet Id 1260955716644470784, tweeted at 15:30:44, 2020-05-14, by BarackObama says: Michelle and I want to do our part to give all you parents a break today, so we’re reading “The Word Collector” for @chipublib. It’s a fun book that vividly illustrates the transformative power of words––and we hope you enjoy it as much as we did. pic.twitter.com/ADYbL6Dzg4
Tweet Id 1260707691900612615, tweeted at 23:05:11, 2020-05-13, by BarackObama says: Despite all the time that’s been lost, we can still make real progress against the virus, protect people from the economic fallout, and more safely approach something closer to normal if we start making better policy decisions now. https://www.vox.com/2020/5/13/21248157/testing-quarantine-masks-stimulus …
....

Scraping tweets with a particular keyword

import twint
# Configure
config = twint.Config()
# Search tweets that mention Taylor Swift
config.Search = "taylor swift"
# Limit search results to 10
config.Limit = 20
# Return tweets that were published after Jan 1st, 2020
config.Since = "2020-01-1 20:30:15"
# Formatting the tweets
config.Format = "Tweet Id {id}, tweeted at {time}, {date}, by {username} says: {tweet}"
# Storing tweets in a csv file
config.Store_csv = True
config.Output = "Taylor Swift"
twint.run.Search(config)

Output:

Tweet Id 1261267734861619201, tweeted at 12:10:35, 2020-05-15, by CVirginie4 says: In 2 days 😍😍💖@taylorswift13 @taylornation13 💖#TaylorSwift #Lover #TaylorSwiftCityOfLover https://twitter.com/cvirginie4/status/1259162021238657024 …
Tweet Id 1261267727961821184, tweeted at 12:10:34, 2020-05-15, by allebahsia says: i met somebody while it was raining and since then everything has changed wow taylor swift the soundtrack to my life huh https://twitter.com/MissAmericHANA/status/1260628201237147648 …
Tweet Id 1261267719145447424, tweeted at 12:10:32, 2020-05-15, by KenHatesU says: Or "Pancake"
Tweet Id 1261267707250577409, tweeted at 12:10:29, 2020-05-15, by Butera_Maraj says: Vodk il a percé au states jusqu’à il côtoie Taylor Swift https://twitter.com/taylorswift13/status/1261054225867472897 …

Scraping Reddit using Reddit API

We will be scraping donation requests made on Reddit by using the official Reddit API. To access it, you need to:

  1. Go to the official Reddit website
  2. Log into your Reddit account or create a new one
  3. Go to User Settings

4. Go to Privacy and Security

5. Go to App authorization

6. Click on ‘are you a developer? create an app’

7. Create a name for your application and fill in the other relevant credentials. In redirect URL, put the URL of your localhost.

8, Click on ‘create app’

9. Copy the characters underneath ‘personal use script’ and next to ‘secret’ and save them in a file or notepad. You will be needing them to gain access to the API.

Installations

We will be using a Python framework named Praw to easily use the Reddit API. To install it, run the following command in your terminal:

pip install praw

Python Code

import praw  
import pandas as pd
import numpy as np
# Fill in your own credentials for client_id, client_secret and user_agent. Characters in'Personal use script' make your client_id, those in 'secret' make client_secret and user_agent is the name of your application.
reddit = praw.Reddit(client_id = '',
client_secret = '',
user_agent = '')
# Get posts from the subreddits related to donations
hot_post_1 = reddit.subreddit ('donate').hot(limit = 10)
hot_post_2 = reddit.subreddit ('Assistance').hot(limit = 10) # Offers
hot_post_3 = reddit.subreddit ('Charity').hot(limit = 10)
hot_post_4 = reddit.subreddit ('Donation').hot(limit = 10)
hot_post_5 = reddit.subreddit ('gofundme').hot(limit = 10) # lots of categories
hot_post_6 = reddit.subreddit ('RandomKindness').hot(limit = 10)
hot_post_7 = reddit.subreddit ('donationrequest').hot(limit = 10 )
# Saving donation posts in an empty list
posts = []
for post in hot_post_1:
posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])

for post in hot_post_2:
posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
for post in hot_post_3:
posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
for post in hot_post_4:
posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])

for post in hot_post_5:
posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])

for post in hot_post_6:
posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])

posts = pd.DataFrame (posts, columns = ['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
#posts
df = pd.DataFrame (data = posts)
dataframe = df.to_csv (r'donations.csv', index = False)
# Data Processing
df = pd.read_csv ('donations.csv')
df = df.drop (['id', 'subreddit', 'num_comments', 'url', 'created'],1)
df = df[['title', 'score','body']]
print (df.head ())
print(df.shape)
# Saving donation posts to a csv file
dataframe = df.to_csv (r'donations.csv', index = False)

Output:

3. Scraping contents of a web page

We will be scraping the text content of a Wikipedia page about Reddit using a simple and powerful Python library named BeautifulSoup. It is also important for you to be familiar with some of the basics of HTML for web scraping. First, right-click and open your browser’s inspector to inspect the webpage. Hover your cursor on the desired section whose content you want to scrape, and you should be able to see a blue box surrounding it. If you click it, the related HTML will be selected in the browser console. The section that we wish to scrape is a div that contains the entire text within the page.

Installations

To install BeautifulSoup, run the following command in your terminal:

pip install BeautifulSoup4

Python Code

# import libraries
import urllib
from bs4 import BeautifulSoup
# specify url of webpage whose content you need to scrape
url = "https://en.wikipedia.org/wiki/Coronavirus"
request = urllib.request.Request (url)
# query the website and return the html of the webpage
response = urllib.request.urlopen (request)
# parse the html using beautiful soup
var = BeautifulSoup (response,'html.parser')
# Take out the <div> and get its value
text_box = var.find ('div', attrs = {'id': 'bodyContent'})
text = text_box.text.strip ()
print (text)

Output:

From Wikipedia, the free encyclopediaJump to navigation
Jump to search
This article is about the group of viruses. For the ongoing disease involved in the COVID-19 pandemic, see Coronavirus disease 2019. For the virus that causes this disease, see Severe acute respiratory syndrome coronavirus 2.
Subfamily of viruses in the family CoronaviridaeOrthocoronavirinaeTransmission electron micrograph (TEM) of avian infectious bronchitis virusIllustration of the morphology of coronaviruses; the club-shaped viral spike peplomers, colored red, create the look of a corona surrounding the virion when observed with an electron microscope.

4. Scraping images

We will be scraping images in batch through the Fatkun Batch Download Image extension.

Prerequisites

You will be needing Google Chrome Browser along with Fatkun Batch Download Image extension.

Steps:

  1. After you are finished with the installation, search for the website and the pictures that you want to download
  2. Click on the extension’s icon
  3. Now an extension will get opened which would display a new tab showing all images that have been detected by it. All the pictures that appear on the extension’s tab by default have opted for the purpose of download. After making the choice, click on ‘save image’.
  4. The extension would now provide you with the warning and will ask where to save the file before it is being downloaded and you have to give the confirmation for each image.
  5. The extension would create for you a new folder based on the title of the website and there you could download all the desired images. You could even click on ‘more options’ so that with the aid of link you could simply filter the images, rename and sort them as per size.

Conclusion

To sum it all up, we started off by getting an introduction to data scraping or data crawling. We applied it in 4 different ways, i.e. how to extract tweets from Twitter without using its API, how to scrape posts from subreddits using the official Reddit API. We learned to scrape content from web pages using Python’s BeautifulSoup library and lastly, we learned how to download images in batch using Google’s Fatkun Batch Image Download extension.

If Crawling is hard? Let us HELP!

While crawling presents easy access to many web-based data collections, most times, such data also accompanies heavy noises and contaminations to be used as a dataset right away. Therefore, companies or researchers need to devote heavy efforts in quality controlling; having enough human resources is always a great challenge. Therefore, it is often more efficient to find another service that does laborious works (including both collection and preprocessing) for you. For that, we could be your perfect solution!

Here at Selectstar, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity simultaneously. Moreover, our in-house managers double-check the quality of the collected or processed data! Check us out at selectstar.ai for more information!

selectstar.ai

--

--